Monitoring Subsystem

Feature Overview

AdvantEDGE provides a built-in Monitoring Subsystem that integrates with scenarios.

This feature provides the following capabilities:

Scenario local measurements
- Automated Network Characteristics: Latency, UL/DL throughput, UL/DL packet loss are automatically recorded
- Automated Events: Scenario events generated towards the Events API are recorded; recorded events can originate from the frontend, from an external source, from a replay file or from one of the automation.
Custom measurements
- Custom metrics: InfluxDB API is available for logging your own time-series metrics; justneed to include an InfluxDB client in your application and start logging.
Dashboard visualization and management interface
- Built-in network characteristics dashboards: visualize point-to-point (source to dest.) or aggregated (source to all) network metrics
- Built-in wireless metrics dashboards: visualize wireless metric KPIs (RSRP, RSRQ, RSSI & PoA distance)
- Custom dashboards: create your own dashboards; allows access to display automated measurements (net.char/events) with your own measurements.
Metrics API
- Expose metrics to applications: Metrics can be exposed to external applications for conducting network adaptative experiments.
Platform metrics local monitoring
- Automated Platform Micro-Services monitoring: Prometheus collects metrics locally about the platform micro-services; this allows AdvantEDGE platform usage metrics in your deployments.
Metrics Long-term Storage (Optional)
- Long-term data retention: Thanos pushes Prometheus metrics to object store every 2 hours
- Daily backups: cronjob pushes InfluxDB data to object store

Micro-Services

InfluxDB: Time-Series database - used to monitor scenario network characteristics, events & custom user metrics.
Grafana: Dashboard visualization and management solution
metrics-engine: Collects automated measurements and implements the metrics API
Prometheus: Collects platform micro-services metrics
Thanos: Prometheus extension for long-term metrics storage (disabled by default)

Scenario Configuration

No scenario configuration

Scenario Runtime

InfluxDB

Influx DB is a time series database; it provides a central aggregation point to store AdvantEDGE metrics.

Out-of-the-box collected metrics are:

Latency
UL/DL Throughput
UL/DL Packet loss
Events

InfluxDB runs as a dependency pod in the platform. When deploying a scenario, AdvantEDGE creates a database for the scenario and stores aforementioned metrics in it. After scenario termination, the stored metrics remain available until the scenario is re-deployed. A user willing to preserve metrics must export these in between scenario runs.

InfluxDB is provided as a platform facility; if desired, users can use the InfluxDB database instance to store demo specific metrics & re-use them for graphing.

Externally from the platform, access to InfluxDB are proxied through Grafana.

If required, AdvantEDGE can be configured to perform nightly backups of entire InfluxDB database to an object store.

Grafana

Grafana is a flexible graphing service that can pull metrics directly from known data sources such as InfluxDB or Prometheus.

Grafana integrates with AdvantEDGE by providing dashboards that are embedded in AdvantEDGE frontend. On platform bring-up, default AdvantEDGE dashboards are imported in the platform.

Grafana is provided as a platform facility; if desired, users can use Grafana to create and store demo specific dashboards. Grafana provides a frontend that can be accessed from the Montitoring page; using Grafana frontend. Demo-specific dashboards can be added to the Monitoring page or the execution page.

Metrics engine

AdvantEDGE provides a /metrics endpoint in its REST API to allow user to collect/use metrics from their scenario control software or to experiment from their edge applications.

The service currently allows to query/subscribe to metrics related to:

network KPIs (latency, UL/DL throughput, jitter, packet-loss)
events received on the /events endpoint (mobility, net.char. update, etc.)
http requests received by the various REST APIs of the platform

Example usage of this API: in a past demo, we subscribed to this API to feed scenario data (throughput usage) into a ML algorithm of ours.

Prometheus

Prometheus is a monitoring & alerting toolkit that collects and stores metrics; its 2 main components are:

Prometheus Server: Scrapes metrics from services and stores them in a time-series database; monitors alert conditions
Alert Manager: Manages and publishes alert notifications

Prometheus metric are stored in a database as time-series uniquely identified by metric name and applied labels. Supported metric types are:

Counter: Single increasing numerical value
Gauge: Single numerical value that can go up or down
Histogram: Observations (durations, sizes, etc.) grouped into configurable buckets; includes counters for sample number & sum
Summary: Observations (durations, sizes, etc.) with calculated quantiles; includes counters for sample number & sum

Prometheus is best used for metrics collection; by grouping data into metric types, Prometheus efficiently supports data storage, queries & alerting. It is an excellent tool for monitoring platform or system usage trends over time.

NOTE: InfluxDB is better suited for event logging and long-term data storage.

Prometheus Server

Prometheus server pulls metrics from configured services by periodically scraping the well-known /metrics endpoint. Each scrape interval, it collets samples from each configured service and stores them in the appropriate time-series.

Services wishing to provide metrics to the Prometheus server must expose the /metrics endpoint and create a custom ServiceMonitor resource. There are several readily available Prometheus exporters and libraries to easily instrument microservices for metrics exposure.

Prometheus exposes its data with the PromQL query language; allows retrieving and aggregating time series data in real time. Queries can be made using the HTTP API; Grafana uses this API and supports Prometheus as a data source for graphing data in its Dashboards.

Prometheus server also monitors its configured alert thresholds, informing the Alert Manager of any alert conditions.

Alert Manager

Alert Manager processes alerts received from Prometheus server. When an alert is received, the Alert Manager sends an alert notification to its configured listeners via e-mail, chat or notification systems.

Alert Manager also supports alert silencing and aggregation.

Thanos

Thanos is an open source, highly available Prometheus setup with long term storage capabilities. It runs as a sidecar in Prometheus pods and pushes data to its provisioned object store for long-term data retention.

Thanos components used in AdvantEDGE include:

Sidecar: Implements the common gRPC StoreAPI & uploads metrics to object store
Store Gateway: Implements the StoreAPI for historical data in an object storage bucket
Query: Implements the Prometheus HTTP API by gathering data from underlying StoreAPIs in a Thanos cluster
Compactor: Compacts data blocks & performs data downsampling

Thanos must be configured & deployed with an object store where long-term data is stored.

NOTE: Thanos is disabled by default in the AdvantEDGE deployment configuration.