Network monitoring evolution: Home Assistant metrics, alert tuning, and LAN latency

I already wrote about standing up WAN and LAN performance testing with Prometheus and Grafana. That post covered the basics: why I moved off a frequently scraped speedtest exporter, wiring Home Assistant into the path, and using iperf3 with the node_exporter textfile collector on Proxmox LXCs.

Running it in production taught me different lessons than the first deploy. Metric names from Home Assistant did not match what my dashboards expected. Alerts fired on single blips. I assumed iperf3 gave me “LAN latency” because the CLI prints numbers, but TCP mode does not measure round-trip time the way I needed. Alongside that work, the usual Prometheus gotchas showed up: SNMP walks that take longer than your scrape interval, and wireless or controller metrics where the exporter is the weak link, not the airtime. This post is the second lap: normalization, alert discipline, and measuring latency with the right tool.

Problem: pretty graphs, noisy pages

The WAN story was always going to be a trade. A dedicated speedtest exporter is simple until you hit provider rate limits or burn bandwidth on a schedule your ISP does not love. Folding WAN tests into Home Assistant fixed the operational issue (one scheduled test, one place to tune frequency), but Prometheus still saw long metric names and labels that did not line up with the recording rules and dashboard variables I had written for the old exporter.

On the alert side, “WAN slower than yesterday” is easy to write and hard to live with. A single slow test looks like an incident. The same pattern showed up on LAN throughput and on infrastructure that is not actually down but fails scrapes often enough to look dead.

Finally, I had a panel that mixed throughput and “latency” from iperf3. Throughput was trustworthy. The latency column was not doing what I thought, which matters when you are trying to separate “WiFi is busy” from “routing is wrong.”

Investigation: what each tool actually measures

Home Assistant exposes its Speedtest integration metrics to Prometheus if you connect the two. The values are fine for trend lines; the shape of the metric names is what breaks dashboards built for another label scheme. The fix is not magic, just mechanical: Prometheus recording rules that alias or aggregate to stable names your Grafana queries can rely on. I stopped editing every panel when upstream naming shifted and centralized the mapping in one place.

For comparing “now” versus “recent baseline,” offset() is the blunt instrument that works. It is not perfect for seasonality (hour-of-day effects), but for “is this test an outlier versus ten minutes ago” it is enough to drive a less twitchy alert expression than a raw threshold on a single sample.

iperf3 in TCP mode reports retransmits and goodput. It does not give you ICMP-style round-trip time. I had been eyeballing fields in the output that looked latency-adjacent without checking the semantics. Once I did, the fix was obvious: measure RTT with ping (or a small UDP probe if you prefer) from the same host pair on the same schedule, and publish those numbers through the same textfile path as the throughput metrics. Systemd timers already drove the iperf runs; adding a ping phase was a small change compared to debugging “slow network” stories with the wrong signal.

Flaky scrapes were a separate class from the WAN and LAN tests themselves. In most homelabs you eventually point Prometheus at SNMP on a NAS or a switch, or at a UniFi-style exporter (unpoller is the common choice). Those jobs time out or skip for reasons that are not always “your network is on fire.” Short for: windows and zero tolerance for missed scrapes meant Alertmanager and I were not on speaking terms. Longer durations, grouping related signals, and excluding or downgrading known-noisy checks returned alerts to the “I should open a ticket” tier instead of the “mute the channel” tier. None of that is specific to how those scrape definitions are stored (Ansible, Compose, or hand-edited YAML); the behavior is the same once Prometheus is scraping them.

Solution: normalize, then alert, then simplify the board

Recording rules for Home Assistant WAN metrics. I defined rules that present consistent series names (download, upload, ping) regardless of how the integration labels its series this month. Grafana variables and repeat panels target those stable names. When the integration updates, I update the rules once instead of chasing panel JSON.

Alerts that require consecutive failures. WAN degradation alerts now ask for multiple bad evaluations in a row before firing. Combined with slightly wider thresholds based on observed variance, that removed the overnight pages from single slow tests. I kept a stricter path for “test did not run at all,” which is a different failure mode than “ISP had a bad minute.”

LAN latency from ping, throughput from iperf3. The textfile job exports both. Dashboards show throughput and RTT side by side with the same timestamp semantics, which makes VLAN and WiFi issues easier to reason about than a blended panel that mixed meanings.

Scrape hygiene on slow or chatty jobs. For targets that legitimately timeout, I aligned scrape intervals and alert for: durations with how long I am willing to wait before caring, and I stopped treating a single missed scrape as a Sev-1. Where an alert duplicated something I already get from a higher-level check, I removed the duplicate.

Dashboard cleanup. I deleted panels that duplicated information, fixed queries that relied on deprecated label matchers, and tightened regex filters so variables only list interfaces I actually graph. Less clutter means faster triage.

Reflection

The theme is the same as every other monitoring rollout I have written about: first make signals trustworthy, then make alerts boring, then polish the UI. Home Assistant as the WAN test runner is an operational win, but it pushed complexity into Prometheus naming. Recording rules paid that down. iperf3 still has a clear job (throughput between chosen endpoints). Latency needed an explicit instrument, not an optimistic read of TCP stats.

I verified the loop the same way I always do: induce a bad test, confirm the alert when it should fire, confirm silence when the link recovers, and read a week of graphs for false negatives. The consolidated network dashboard is easier to scan than the first version, and my notification volume matches how seriously I actually treat homelab degradation.

If you already run the stack from the earlier post, treat this as the tuning pass: stable metric names, alert expressions that match human patience, and latency measured with a tool that measures latency. Everything else is commentary.

Victor Da Luz

Problem: pretty graphs, noisy pages

Investigation: what each tool actually measures

Solution: normalize, then alert, then simplify the board

Reflection

Related reading

Setting Up Network Performance Testing Infrastructure

Prometheus and Grafana: Why Your Homelab Needs Monitoring

Researching Node Exporter on macOS workstations for homelab monitoring

Ready to Transform Your Career?