Feature Overview
AdvantEDGE provides a built-in Monitoring Subsystem that integrates with scenarios.
This feature provides the following capabilities:
- Scenario local measurements
- Automated Network Characteristics: Latency, UL/DL throughput, UL/DL packet loss are automatically recorded
- Automated Events: Scenario events generated towards the Events API are recorded; recorded events can originate from the frontend, from an external source, from a replay file or from one of the automation.
- Custom measurements
- Custom metrics: InfluxDB API is available for logging your own time-series metrics; justneed to include an InfluxDB client in your application and start logging.
- Dashboard visualization and management interface
- Built-in network characteristics dashboards: visualize point-to-point (source to dest.) or aggregated (source to all) network metrics
- Built-in wireless metrics dashboards: visualize wireless metric KPIs (RSRP, RSRQ, RSSI & PoA distance)
- Custom dashboards: create your own dashboards; allows access to display automated measurements (net.char/events) with your own measurements.
- Metrics API
- Expose metrics to applications: Metrics can be exposed to external applications for conducting network adaptative experiments.
- Platform metrics local monitoring
- Automated Platform Micro-Services monitoring: Prometheus collects metrics locally about the platform micro-services; this allows AdvantEDGE platform usage metrics in your deployments.
- Metrics Long-term Storage (Optional)
- Long-term data retention: Thanos pushes Prometheus metrics to object store every 2 hours
- Daily backups: cronjob pushes InfluxDB data to object store
Micro-Services
- InfluxDB: Time-Series database - used to monitor scenario network characteristics, events & custom user metrics.
- Grafana: Dashboard visualization and management solution
- metrics-engine: Collects automated measurements and implements the metrics API
- Prometheus: Collects platform micro-services metrics
- Thanos: Prometheus extension for long-term metrics storage (disabled by default)
Scenario Configuration
No scenario configuration
Scenario Runtime
InfluxDB
Influx DB is a time series database; it provides a central aggregation point to store AdvantEDGE metrics.
Out-of-the-box collected metrics are:
- Latency
- UL/DL Throughput
- UL/DL Packet loss
- Events
InfluxDB runs as a dependency pod in the platform. When deploying a scenario, AdvantEDGE creates a database for the scenario and stores aforementioned metrics in it. After scenario termination, the stored metrics remain available until the scenario is re-deployed. A user willing to preserve metrics must export these in between scenario runs.
InfluxDB is provided as a platform facility; if desired, users can use the InfluxDB database instance to store demo specific metrics & re-use them for graphing.
Externally from the platform, access to InfluxDB are proxied through Grafana.
If required, AdvantEDGE can be configured to perform nightly backups of entire InfluxDB database to an object store.
Grafana
Grafana is a flexible graphing service that can pull metrics directly from known data sources such as InfluxDB or Prometheus.
Grafana integrates with AdvantEDGE by providing dashboards that are embedded in AdvantEDGE frontend. On platform bring-up, default AdvantEDGE dashboards are imported in the platform.
Grafana is provided as a platform facility; if desired, users can use Grafana to create and store demo specific dashboards. Grafana provides a frontend that can be accessed from the Montitoring page; using Grafana frontend. Demo-specific dashboards can be added to the Monitoring page or the execution page.
Metrics engine
AdvantEDGE provides a /metrics
endpoint in its REST API to allow user to collect/use metrics from their scenario control software or to experiment from their edge applications.
The service currently allows to query/subscribe to metrics related to:
- network KPIs (latency, UL/DL throughput, jitter, packet-loss)
- events received on the
/events
endpoint (mobility, net.char. update, etc.) - http requests received by the various REST APIs of the platform
Example usage of this API: in a past demo, we subscribed to this API to feed scenario data (throughput usage) into a ML algorithm of ours.
Prometheus
Prometheus is a monitoring & alerting toolkit that collects and stores metrics; its 2 main components are:
- Prometheus Server: Scrapes metrics from services and stores them in a time-series database; monitors alert conditions
- Alert Manager: Manages and publishes alert notifications
Prometheus metric are stored in a database as time-series uniquely identified by metric name and applied labels. Supported metric types are:
- Counter: Single increasing numerical value
- Gauge: Single numerical value that can go up or down
- Histogram: Observations (durations, sizes, etc.) grouped into configurable buckets; includes counters for sample number & sum
- Summary: Observations (durations, sizes, etc.) with calculated quantiles; includes counters for sample number & sum
Prometheus is best used for metrics collection; by grouping data into metric types, Prometheus efficiently supports data storage, queries & alerting. It is an excellent tool for monitoring platform or system usage trends over time.
NOTE: InfluxDB is better suited for event logging and long-term data storage.
Prometheus Server
Prometheus server pulls metrics from configured services by periodically scraping the well-known /metrics
endpoint. Each scrape interval, it collets samples from each configured service and stores them in the appropriate time-series.
Services wishing to provide metrics to the Prometheus server must expose the /metrics
endpoint and create a custom ServiceMonitor
resource. There are several readily available Prometheus exporters and libraries to easily instrument microservices for metrics exposure.
Prometheus exposes its data with the PromQL query language; allows retrieving and aggregating time series data in real time. Queries can be made using the HTTP API; Grafana uses this API and supports Prometheus as a data source for graphing data in its Dashboards.
Prometheus server also monitors its configured alert thresholds, informing the Alert Manager of any alert conditions.
Alert Manager
Alert Manager processes alerts received from Prometheus server. When an alert is received, the Alert Manager sends an alert notification to its configured listeners via e-mail, chat or notification systems.
Alert Manager also supports alert silencing and aggregation.
Thanos
Thanos is an open source, highly available Prometheus setup with long term storage capabilities. It runs as a sidecar in Prometheus pods and pushes data to its provisioned object store for long-term data retention.
Thanos components used in AdvantEDGE include:
- Sidecar: Implements the common gRPC StoreAPI & uploads metrics to object store
- Store Gateway: Implements the StoreAPI for historical data in an object storage bucket
- Query: Implements the Prometheus HTTP API by gathering data from underlying StoreAPIs in a Thanos cluster
- Compactor: Compacts data blocks & performs data downsampling
Thanos must be configured & deployed with an object store where long-term data is stored.
NOTE: Thanos is disabled by default in the AdvantEDGE deployment configuration.