Troubleshooting QNAP SNMP monitoring timeouts

Storage in my homelab sits on a QNAP TS-h973AX running QuTS Hero. It backs Proxmox Backup Server over NFS, holds Time Machine targets over SMB, and shows up in Grafana via the SNMP exporter. For a while, that last part was the noisy one: Prometheus kept treating the NAS as “down” when the real problem was that SNMP walks were slower than my timeouts.

This is what I changed, what I got wrong once, and what happened when I peeled the next layer (NFS and Time Machine) after SNMP stopped crying wolf.

The problem

The snmp-nas* scrape jobs against the SNMP exporter were timing out. Prometheus would mark targets failed, and NASServiceDown-style alerts fired even though the NAS was fine.

The SNMP agent on this box is not fast. Walking the QuTS Hero OID tree takes long enough that my original module timeouts in the exporter were shorter than reality, and I had not consistently kept Prometheus scrape timeouts above the module timeouts. The qnap module was set to about 20s while scrapes were taking on the order of 60–90s. The qnaplong module was tighter than the heavy walk it performs. Scrapes could not finish before something gave up.

Investigation

I split the work mentally into two layers:

SNMP exporter — how long a single module waits for the agent and how many retries it allows.
Prometheus — how long a scrape is allowed to run versus how often you scrape.

The rule I should have applied from the start: scrape timeout must be greater than the exporter module timeout, and both need to reflect how slow the device actually is, not how fast I wish it were.

I also tried trimming the qnaplong module to speed things up by dropping “system” OIDs. That was a mistake: Grafana lost CPU, memory, uptime, and model info. I put those metrics back.

Solution: SNMP exporter

In the homelab repo, the SNMP exporter modules for QNAP live under iac/services/prometheus-grafana/snmp-exporter/fragments/modules/qnap.yml (paths vary slightly if you reorganize; the idea is the same).

Roughly what I moved to:

qnap module: timeout increased from 20s to 100s, retries from 1 to 2.
qnaplong module: timeout increased from 30s to 280s (retries were already at 2).

That matches the idea of a “fast” job versus a periodic deep walk without pretending the agent is snappy.

Solution: Prometheus

In iac/services/prometheus-grafana/prometheus/prometheus.yml, the scrape jobs needed to line up with those module timeouts:

snmp-nas: scrape timeout increased to 120s (same order as the 120s scrape interval in my setup).
snmp-nas-long: scrape timeout set to 5m, matching a 5m scrape interval.

If your intervals differ, keep the same discipline: interval ≥ timeout, and timeout > SNMP module time you need for the slowest walk.

Verification

After deployment, snmp-nas-long settled into a healthy pattern: scrape duration around ~132s, target up, well inside the 280s module budget and the 5m scrape cap.

snmp-nas was better but still riding close to the edge; the agent really does burn most of a 100s window on some walks. The split between a lighter job and a long job is still worth it, but neither side gets “normal” Linux node_exporter timings.

Update: NFS disconnections

After SNMP was under control, a separate issue surfaced: NFS from the NAS to Proxmox would glitch hard enough to break PBS backups and make the cluster unhappy. Hard mounts plus minimal timeouts are a bad combo when the server blips: clients can hang instead of failing fast, which matches the “unresponsive GUI” class of Proxmox pain others have written up.

I moved the PBS-related mounts toward options that fail and retry instead of wedging:

rw,vers=3,_netdev,soft,timeo=30,retrans=3,noresvport

soft — operations can fail with an error instead of blocking forever when the server is gone.
timeo=30 — 3s RPC timeout (value is in tenths of a second for NFS).
retrans=3 — retry count before giving up on that request.
noresvport — helps with reconnect behavior after network interruptions (common recommendation in Proxmox + remote NFS threads).

Ansible changes lived in the PBS role and group vars; Prometheus picked up broader NFS mount alerts so both nodes would page when a mount looked wrong, not only the PBS-specific paths I had before.

Update: Time Machine and SMB3

Time Machine to the NAS started failing with unhelpful “unknown error” style messages. On the NAS, the minimum SMB dialect was still SMB 2.0.2. QNAP even warns that older minimums can interfere with Time Machine tasks. macOS 26 (Tahoe) wants SMB3 for Time Machine over SMB.

I raised the minimum to SMB3 in Control Panel → Network & File Services → Win/Mac/NFS → Microsoft Networking → Advanced. Backups started completing again. Trade-off: anything that only speaks SMB2 needs a different target or a deliberate exception; for my house, everything modern enough to run Tahoe is fine on SMB3.

What I learned

QNAP’s SNMP agent is slow. Budget timeouts like you mean it, then add margin. Treat exporter module time and Prometheus scrape time as one system.

Do not strip “system” metrics from a long walk to “optimize” unless you are sure the dashboard does not need them. I had to revert that change.

Storage monitoring and storage mounts are related but not the same. Fixing SNMP did not fix NFS behavior; it just stopped monitoring from lying while I chased the next failure mode.

SMB minimum version is a real compatibility knob. When the OS moves (Tahoe), recheck the NAS side even if “nothing changed” on the Mac.

Disclosure: This article contains affiliate links. If you purchase through these links, I may earn a commission at no extra cost to you.

Victor Da Luz

The problem

Investigation

Solution: SNMP exporter

Solution: Prometheus

Verification

Update: NFS disconnections

Update: Time Machine and SMB3

What I learned

Related reading

Researching Node Exporter on macOS workstations for homelab monitoring

Monitoring UniFi Devices with Prometheus and Grafana

Integrating Proxmox Backup Server with the Cluster: Decisions and Troubleshooting

Ready to Transform Your Career?