Skip to content
Infrastructure

Rebalancing my Proxmox cluster and the Pi-hole outage I caused

By Victor Da Luz
proxmox homelab infrastructure high-availability lxc pihole

With both nodes in my Proxmox cluster upgraded to 32GB RAM, I had a distribution problem I’d been ignoring: node01 was carrying ~18.8GB of allocated memory across 12 containers while node02 had about 9.6GB. A failover from node01 would dump a lot of load onto a node already running its own services. With equal RAM across both nodes, there was no longer any reason to leave it that way.

What followed was mostly straightforward LXC migration - and one self-inflicted DNS outage.

The plan and the constraints

I mapped out what couldn’t move before touching anything:

  • Home Assistant and NUT: both need USB passthrough. Stuck in place.
  • Immich and Frigate: highest-usage containers in the cluster, already on different nodes. Leave them.
  • RomM: had a long library scan running. Not a good time to migrate.
  • PBS (Proxmox Backup Server): has a bind mount to NAS storage. Bind mounts block live migration.

That left lighter non-HA containers as the obvious candidates. mealie (2GB), audiobookshelf (2GB), and homebox (1GB) were all sitting on node01 with no constraints. Moving them to node02 would drop node01’s allocation by 5GB.

I also wanted to balance the HA container count. node01 had 6 HA-managed containers vs 4 on node02. Moving the Pi-hole container pihole01 to node02 would split that 5-5.

The migrations

Non-HA containers need to be stopped before migrating:

pct stop 117 && pct migrate 117 node02
pct stop 120 && pct migrate 120 node02
pct stop 122 && pct migrate 122 node02

Each one took a minute or two and started clean on node02. I also moved pihole01 (the HA-managed Pi-hole on node01) to node02 to balance the HA container count.

After all the moves:

  • node01: 5 HA containers, 2 non-HA, ~11.7GB allocated
  • node02: 5 HA containers, 4 non-HA, ~15.9GB allocated

That looked good. Then I checked Uptime Kuma.

The outage

Multiple monitors were red. DNS was the common thread.

Both Pi-hole containers were on node02, and both were unreachable:

  • ct:110 (pihole02) - node02, unreachable
  • ct:116 (pihole01) - node02, unreachable

With both DNS resolvers down simultaneously, everything depending on DNS reported offline.

The problem was obvious in hindsight. DNS redundancy in a two-node cluster means one Pi-hole per node. If both land on the same node and that node has a problem - or even just a transient connectivity issue during migration - you lose all DNS at once. I’d been focused on balancing the HA container count and hadn’t thought through what that meant for Pi-hole specifically.

I moved pihole01 back to node01. The container came up but its network interface was in a DOWN state - no IP, no connectivity.

The network interface fix

I could bring the interface up manually:

ip link set eth0 up

That got pihole01 an IP via DHCP and restored connectivity. But it wouldn’t come up automatically on the next container start.

The issue was in /etc/network/interfaces inside the container:

auto eth0

The auto eth0 line tells the networking init to bring up eth0 at boot - but without the iface stanza, it doesn’t know how to configure it. No DHCP instruction, so the interface comes up DOWN with no address.

Adding the missing line fixed it:

auto eth0
iface eth0 inet dhcp

After restarting the container, the interface came up automatically with a full DHCP address. pihole01 was reachable and DNS was restored.

The final distribution

After moving pihole01 back and fixing its network config:

  • node01 (6 HA, 2 non-HA): ~12.1GB allocated
  • node02 (4 HA, 4 non-HA): ~15.9GB allocated

The HA count ended up 6-4 instead of 5-5 because pihole01 went back to node01. That’s fine - HA balance isn’t about an equal split, it’s about making sure neither node is catastrophically overloaded during a failover. 12.1GB vs 15.9GB is a much healthier spread than 18.8 vs 9.6.

Pi-hole redundancy was back where it should be: pihole01 on node01, pihole02 on node02.

Lessons

Map placement constraints before any migration. Hardware constraints (USB, bind mounts) are obvious - service-level constraints are easier to miss. Any active-active redundant service needs to stay split across nodes. I checked the hardware constraints and missed the service constraint.

HA container count isn’t the only balance metric. I was optimizing for a 5-5 HA container split and caused a DNS outage doing it. The right goal is no catastrophic concentration of services on one node - which isn’t the same as an equal count.

Verify network config after migrating a container. The missing iface eth0 inet dhcp line may have been a pre-existing gap or something that got lost in migration. Either way, connectivity checks after every migration would have caught this before I called it done.

Check dependent services immediately after any migration. I found the outage in Uptime Kuma, but only after moving on to the next task. Testing DNS resolution right after moving a Pi-hole container would have caught the issue in seconds.


This was the follow-up to upgrading both nodes to 32GB RAM. The cluster is running more comfortably now, and I’ve got a better mental model of what needs explicit placement planning vs what can move freely.

Related reading

Infrastructure

Migrating Pi-hole from a Raspberry Pi to a Proxmox LXC

Replacing pi2.internal (Raspberry Pi 4) with pihole01, a Proxmox LXC container, as the new Pi-hole master. The migration itself was uneventful; the surprises were in TLS, Pi-hole v6 exporter auth, and Grafana label relabeling.

Read

Ready to Transform Your Career?

Let's work together to unlock your potential and achieve your professional goals.