The Adventures of the Energy Metrics Data Collector

The Adventures of the Energy Metrics Data Collector – Going Green with Networking Technology

Kurt Semba Principal Architect, Office of the CTO Published 30 Jun 2022

In April, I wrote a blog to celebrate Earth Day titled, “Who’s Going to Save the Polar Bears? Environmentalists, Politicians, or Engineers?” This blog was the first in a series of “green” blogs about the enterprise networking industry and how we can be more intelligent with energy consumption via technology innovation.

As I slide into the second blog of this continuing series, I hope to give you a “behind the scenes” view of what the Office of the CTO at Extreme Networks is doing to research possible innovation in this arena. We decided to take an engineering, data-driven approach to discover possible ways to conserve energy for our customers. Our engineers have built a metrics collector and have begun collecting data.

In preparation for this project, we talked to interested customers who use a mix of cloud and on-premises network management systems. Our goal is to retrieve quality metrics from an extensive spectrum of customer scenarios while enabling ultimate flexibility for data collection and deployment options. So, we built a dedicated, on-premises data collection engine to run onsite for a few weeks at some of our most supportive customer locations. We are capturing live data from production networks versus our engineering labs. And what is the end-game for this data-driven project? Lessons learned from the collected data can then be funneled into our mainstream product portfolio to conserve energy consumption. More specifically, the goal is to derive multiple benefits from this project:

Provide empirical data to our core engineering team to help them enhance our products regarding energy consumption and efficiency.
Potentially add energy consumption-related dashboards to our existing management and monitoring products.
Create recommendations to our customers in the form of whitepapers, knowledgebase articles, blogs, etc.
Spark new research ideas and innovative thinking.

Allow me to outline our overall approach for the data collection. We delivered the collector as a VMWare OVA to customers. This method solves a few challenges. First, the data can be collected and stored on-premises using a single virtual machine (VM) which enables us to collect data from non-cloud enabled switches and Wi-Fi controllers. Second, all the required solution components are pre-installed on a single VM, with no external dependencies. Third, this provides an ease of installation at customer locations. Most customers already have a VMWare infrastructure and have the technical expertise for operating a VM.

As depicted in Figure 1, using either SSH or SNMP, the collector regularly queries data from switches using either the Switch Engine (EXOS) or Fabric Engine (VOSS) operating systems. To extract data from two supported Wi-Fi platforms (ExtremeCloud IQ Controller and WiNG controller), the collector uses the Wi-Fi solutions’ representational state transfer (REST) APIs. The data is stored locally on a time series database (TSDB). We can export each database after the data collection phase ends and securely move the data to our research lab. The data from all participating customers will be stored in a central DB and used to run multiple analytical models using both machine learning techniques and manual discovery methods. While validating any initial assumptions, we anticipate gaining many new valuable insights.

Figure 1 – The metrics collector

The collector VM also provides a few dashboards locally so customers can inspect some of the data during the collection process. This onsite visibility provides immediate value during the early phases of this project.

So what data are we collecting and why? The following is a broad overview of some of the wired and wireless metrics related to power consumption that we are gathering from the real-world production environments of customers.

Wired (switches): data collected – why we are collecting

Interface speed – What influence does this have on power efficiency?
Interface traffic – Does more traffic equal more energy use? How much more?
Link topology via Link-Layer Discovery Protocol (LLDP) – Does different network topologies impact energy consumption?
Power supply unit (PSU) #, PSU type, PSU consumption – How much energy is consumed per switch and PSU type?
Power over Ethernet (PoE) port classes & per port consumption – How much of the overall energy is used by PoE consumers?
Switch models – How much more efficient are newer models (newer chips) versus older models?
CPU & memory utilization percentage – Is there a correlation with energy consumption?
Fan speed – Do different speeds correlate with energy consumption?
Physical locations: name, country, city, time zones – Data can be used for filtering & aggregation views in dashboards

Wireless: data collected – why we are collecting

Access points (APs): serial, type, MAC, IP, name, SW, site, ETH power status – Base data
LLDP data – APs don’t report their energy consumption, so we need to combine the per port PoE consumption of a switch with the LLDP data to know which switch port an AP connects to and eventually how much power it consumes.
Aggregate client traffic per radio: bytes, packets, OS Type – Do different traffic flows correlate with energy consumption?
Aggregate radio metrics: Wi-Fi frequency channel, channel size, channel utilization, signal-to-noise ratio (SNR), received signal strength (RSS), percentage of clients – How do these multiple radio metrics align with energy consumption?
Physical locations: name, country, city, time zones – Data can be used for filtering & aggregation views in dashboards

So, you might ask me for further details of the inner workings of this data-collection project. In this blog, I will primarily focus on the technical stack of the collector. As we progress, I will discuss the backend technology in future blog posts. Here is a quick breakdown of the collector components:

The Collector

The collector is written in Golang because the language has a small footprint and provides a reliable mechanism for concurrency called goroutines. Let’s say we want to collect metrics from a switch every 5 minutes. For EXOS switches, we decided to use secure shell (SSH) as the communication protocol and run debug commands to grab the metrics. This process can sometimes take longer than a minute to complete. Now apply that to a customer network with 100 switches, and you understand why concurrency is mandatory: a 5-minute collection interval would be over if we collect data synchronously with only five or six switches. Goroutines allow us to open multiple SSH sessions simultaneously and thus increase data collection efficiency.

The Database

We are collecting metrics over time (for example: every 5 minutes) with thousands of metrics per minute on a typical customer network. Therefore, we require a DB that efficiently handles storing and querying timeseries data. So, we ended up going with TimescaleDB for the following reasons:

TimescaleDB provides a high write-rate using a mechanism called hypertables which turns a normal table into multiple chunks that can be written to in parallel.
TimescaleDB provides a continuous aggregates feature which automatically aggregates a raw metrics table into time chunks.
To give an example on how powerful this is: let’s say we want to build a dashboard chart that shows the average energy use over time of all power supplies from all switches. If 100 switches collect this metric every 5 minutes, the raw table will have 1200 entries for every hour. The dashboard should display this for the past month. To draw this chart, a Structured Query Language (SQL) query will have to process 1200 metrics * 24 hours * 30 days = 864.000 values. Using the continuous aggregate feature from TimescaleDB will provide a new materialized view that automatically calculates the average of those 1200 hourly metrics into a single value per hour. Dashboard queries use the materialized view instead of the raw metrics table. The dashboard only needs to process 1 metric * 24 hours * 30 days = 720 values. The dashboard will load much quicker and the user will be happy with its performance.
TimescaleDB builds on standard PostgreSQL and thus works with SQL and most standard tools available.

The Dashboard

As we wanted to provide our supportive customers with immediate value out of this project, we decided to add Grafana to the OVA. Grafana is a multi-platform open-source analytics and interactive visualization web application. Grafana has built-in support for TimescaleDB and allows for the visualization of timeseries data in very powerful and customizable ways. As seen in Figure 2, the chart displays timeseries data on the left panel and the same data in an aggregated fashion on the right panel. The right panel uses a standard table visualization with an overwrite for the kWh column to make it use the LCD gauge cell display mode with custom, percentage-based thresholds for coloring.

Figure 2 – Grafana visualization dashboard

The Log Aggregator

In addition to displaying metrics, Grafana can also be used to work with the application logs in a more powerful and visual way if we use Loki in the tech stack. Loki is a log aggregation system designed to store and query logs from all your applications and infrastructure. Our main application is our collector, so we configured the Docker container to use the Loki Docker driver to forward logs towards the Loki service. Figure 3 shows us a simple example on how we can query all “ERROR” logs filtered for the “collector” Docker container logs within the last 7 days. Grafana will chart the occurrence frequency of those ERROR logs over time and display the detailed logs.

Figure 3 – Loki, log file aggregation

We can even have Grafana display the “context” of each ERROR by showing logs that occurred immediately before and after that ERROR log – very impressive and I highly recommend it especially if any application gets more complex and consists of more than a few containers!

The Orchestration

Every component of our tech stack is running as a Docker container for obvious reasons. We are using Docker Compose to build, start, and stop the application, the Docker network, and all Docker volumes. It still amazes me how easy it is to upgrade the application after it has been deployed at a customer site: simply run docker pull to get the latest version of images and then docker compose up -d to restart only those containers that have a newer image. That’s it.

We have begun our data collection effort. Do you want to participate? If you are an Extreme customer and would like to participate, you can find more information on the project’s landing page.

You can help us understand energy use, lower your energy use, and help save the polar bears!

I’m looking forward my next blog in this series where we will start looking at the collected data and see which insights we can derive. We invite you to come with us on this journey over the next few months to learn from our results. Stay tuned.

Australia

United Kingdom

Canada

China

France

Germany

Italy

Japan

Korea

Latin America

META

Netherlands

Poland

Spain