Rebalancing my Proxmox cluster and the Pi-hole outage I caused

With both nodes in my Proxmox cluster upgraded to 32GB RAM, I had a distribution problem I’d been ignoring: node01 was carrying ~18.8GB of allocated memory across 12 containers while node02 had about 9.6GB. A failover from node01 would dump a lot of load onto a node already running its own services. With equal RAM across both nodes, there was no longer any reason to leave it that way.

What followed was mostly straightforward LXC migration - and one self-inflicted DNS outage.

The plan and the constraints

I mapped out what couldn’t move before touching anything:

Home Assistant and NUT: both need USB passthrough. Stuck in place.
Immich and Frigate: highest-usage containers in the cluster, already on different nodes. Leave them.
RomM: had a long library scan running. Not a good time to migrate.
PBS (Proxmox Backup Server): has a bind mount to NAS storage. Bind mounts block live migration.

That left lighter non-HA containers as the obvious candidates. mealie (2GB), audiobookshelf (2GB), and homebox (1GB) were all sitting on node01 with no constraints. Moving them to node02 would drop node01’s allocation by 5GB.

I also wanted to balance the HA container count. node01 had 6 HA-managed containers vs 4 on node02. Moving the Pi-hole container pihole01 to node02 would split that 5-5.

The migrations

Non-HA containers need to be stopped before migrating:

pct stop 117 && pct migrate 117 node02
pct stop 120 && pct migrate 120 node02
pct stop 122 && pct migrate 122 node02

Each one took a minute or two and started clean on node02. I also moved pihole01 (the HA-managed Pi-hole on node01) to node02 to balance the HA container count.

After all the moves:

node01: 5 HA containers, 2 non-HA, ~11.7GB allocated
node02: 5 HA containers, 4 non-HA, ~15.9GB allocated

That looked good. Then I checked Uptime Kuma.

The outage

Multiple monitors were red. DNS was the common thread.

Both Pi-hole containers were on node02, and both were unreachable:

ct:110 (pihole02) - node02, unreachable
ct:116 (pihole01) - node02, unreachable

With both DNS resolvers down simultaneously, everything depending on DNS reported offline.

The problem was obvious in hindsight. DNS redundancy in a two-node cluster means one Pi-hole per node. If both land on the same node and that node has a problem - or even just a transient connectivity issue during migration - you lose all DNS at once. I’d been focused on balancing the HA container count and hadn’t thought through what that meant for Pi-hole specifically.

I moved pihole01 back to node01. The container came up but its network interface was in a DOWN state - no IP, no connectivity.

The network interface fix

I could bring the interface up manually:

ip link set eth0 up

That got pihole01 an IP via DHCP and restored connectivity. But it wouldn’t come up automatically on the next container start.

The issue was in /etc/network/interfaces inside the container:

auto eth0

The auto eth0 line tells the networking init to bring up eth0 at boot - but without the iface stanza, it doesn’t know how to configure it. No DHCP instruction, so the interface comes up DOWN with no address.

Adding the missing line fixed it:

auto eth0
iface eth0 inet dhcp

After restarting the container, the interface came up automatically with a full DHCP address. pihole01 was reachable and DNS was restored.

The final distribution

After moving pihole01 back and fixing its network config:

node01 (6 HA, 2 non-HA): ~12.1GB allocated
node02 (4 HA, 4 non-HA): ~15.9GB allocated

The HA count ended up 6-4 instead of 5-5 because pihole01 went back to node01. That’s fine - HA balance isn’t about an equal split, it’s about making sure neither node is catastrophically overloaded during a failover. 12.1GB vs 15.9GB is a much healthier spread than 18.8 vs 9.6.

Pi-hole redundancy was back where it should be: pihole01 on node01, pihole02 on node02.

Lessons

Map placement constraints before any migration. Hardware constraints (USB, bind mounts) are obvious - service-level constraints are easier to miss. Any active-active redundant service needs to stay split across nodes. I checked the hardware constraints and missed the service constraint.

HA container count isn’t the only balance metric. I was optimizing for a 5-5 HA container split and caused a DNS outage doing it. The right goal is no catastrophic concentration of services on one node - which isn’t the same as an equal count.

Verify network config after migrating a container. The missing iface eth0 inet dhcp line may have been a pre-existing gap or something that got lost in migration. Either way, connectivity checks after every migration would have caught this before I called it done.

Check dependent services immediately after any migration. I found the outage in Uptime Kuma, but only after moving on to the next task. Testing DNS resolution right after moving a Pi-hole container would have caught the issue in seconds.

This was the follow-up to upgrading both nodes to 32GB RAM. The cluster is running more comfortably now, and I’ve got a better mental model of what needs explicit placement planning vs what can move freely.

Victor Da Luz

The plan and the constraints

The migrations

The outage

The network interface fix

The final distribution

Lessons

Related reading

Upgrading my Proxmox cluster to 32GB per node and testing HA failover

From Docker Swarm to Proxmox HA: A Homelab Migration Journey

Migrating Pi-hole from a Raspberry Pi to a Proxmox LXC

Ready to Transform Your Career?