My monitoring stack was 59% of my DNS traffic, so I cached it

I had a Pi-hole pinned at 100 percent CPU. Giving its container more cores fixed the symptom, and that is its own story. While I was in there, the query log told me something I did not expect: a single host, my monitoring stack, was responsible for 59 percent of every DNS query Pi-hole was answering. Adding cores treated the symptom. This is the part where I went after the cause.

Where the queries were coming from

The monitoring host runs Prometheus and a stack of exporters. Prometheus scrapes its targets on a schedule, and to scrape a target by hostname it first has to resolve that hostname. Mine had 31 active targets across 21 unique hostnames, scraping every 15 seconds. With no caching in the middle, every one of those scrapes sent a fresh DNS query, and it sent two: an A lookup for the IPv4 address and an AAAA lookup for IPv6. Twenty-one hostnames, doubled for A and AAAA, every 15 seconds, forever. That worked out to about 370 queries a minute, and it was 59 percent of Pi-hole’s entire load.

So the real fix sits upstream of Pi-hole: stop sending it the same questions hundreds of times a minute.

A small cache in front of the firehose

The answer is a caching resolver on the monitoring host itself, so repeated lookups are served locally and only genuine misses ever reach Pi-hole. I used dnsmasq for it, a tiny, boring, extremely good caching DNS server.

Pointing the Docker containers at it took one wrinkle. A container’s resolv.conf points at 127.0.0.11, Docker’s own embedded resolver, and you do not edit that directly. Instead you tell Compose what the embedded resolver should forward to:

services:
  prometheus:
    dns:
      - 172.18.0.1 # the docker bridge gateway, where dnsmasq listens

Now Prometheus’s lookups go to Docker’s resolver, which forwards to dnsmasq on the host, which answers from cache and only forwards a real miss upstream to Pi-hole.

Two settings that were stopping it from caching anything

Standing the cache up was easy. Getting it to actually cache took two corrections, and both were me misunderstanding dnsmasq.

First, I reached for local-ttl to set a minimum cache time, and it did nothing. The logs showed every query still being forwarded. The reason is that local-ttl only applies to names dnsmasq serves from /etc/hosts. It has no effect on responses that came from an upstream server, which is all of mine. The knob I actually wanted was min-cache-ttl, which forces a minimum lifetime on cached upstream responses regardless of the short TTLs they arrive with. I set that, and the A queries started caching.

Second, the AAAA queries, half of all the lookups, still forwarded every single time. The clue was in the log: the AAAA responses came back as NODATA-IPv6, meaning the hostname exists but has no IPv6 address. That is a perfectly valid answer and worth caching, but my config had no-negcache set, which tells dnsmasq to never cache negative answers. So every AAAA lookup for an IPv4-only host went all the way to Pi-hole, every time. Removing no-negcache and adding neg-ttl=1800 to keep negative answers for 30 minutes fixed the other half of the problem.

The config that finally worked:

cache-size=10000
min-cache-ttl=1800   # force a 30-minute floor on cached upstream answers
max-cache-ttl=3600   # cap at one hour
neg-ttl=1800         # cache NODATA / negative answers for 30 minutes too

The result

The next time I checked the cache, it was returning 792 hits against 5 misses over a two-minute window, a 99.4 percent hit rate. Query volume from the monitoring host dropped from about 370 a minute to 87-130, a 65 to 76 percent cut. Together with giving the Pi-hole container more cores, that took it from pinned at 100 percent to sitting around 25-30 percent with capacity to spare.

Lessons

Look at who is actually querying before you scale the thing being queried. One host was 59 percent of my DNS load, and it was my own monitoring. The cheapest query is the one you never send.
local-ttl only covers /etc/hosts, so it is the wrong knob for caching upstream answers. To force caching of upstream responses with short TTLs, use min-cache-ttl.
Cache your negatives. NODATA-IPv6 is a valid answer. With no-negcache set, every AAAA lookup for an IPv4-only host forwards forever, and AAAA is half of your lookups.
Cache at the noisy edge. Putting the resolver on the host doing the querying kept those queries off the network and off Pi-hole entirely, instead of just helping the central server survive them.

Victor Da Luz

Where the queries were coming from

A small cache in front of the firehose

Two settings that were stopping it from caching anything

The result

Lessons

Related reading

Migrating Pi-hole from a Raspberry Pi to a Proxmox LXC

Giving every homelab device a readable name in Pi-hole

Getting Nebula-Sync working with Pi-hole v6: stale passwords and a redirect trap

Ready to Transform Your Career?