Rebalancing my Proxmox cluster and the Pi-hole outage I caused
With both nodes in my Proxmox cluster upgraded to 32GB RAM, I had a distribution problem I’d been ignoring: node01 was carrying ~18.8GB of allocated memory across 12 containers while node02 had about 9.6GB. A failover from node01 would dump a lot of load onto a node already running its own services. With equal RAM across both nodes, there was no longer any reason to leave it that way.
What followed was mostly straightforward LXC migration - and one self-inflicted DNS outage.
The plan and the constraints
I mapped out what couldn’t move before touching anything:
- Home Assistant and NUT: both need USB passthrough. Stuck in place.
- Immich and Frigate: highest-usage containers in the cluster, already on different nodes. Leave them.
- RomM: had a long library scan running. Not a good time to migrate.
- PBS (Proxmox Backup Server): has a bind mount to NAS storage. Bind mounts block live migration.
That left lighter non-HA containers as the obvious candidates. mealie (2GB), audiobookshelf (2GB), and homebox (1GB) were all sitting on node01 with no constraints. Moving them to node02 would drop node01’s allocation by 5GB.
I also wanted to balance the HA container count. node01 had 6 HA-managed containers vs 4 on node02. Moving the Pi-hole container pihole01 to node02 would split that 5-5.
The migrations
Non-HA containers need to be stopped before migrating:
pct stop 117 && pct migrate 117 node02
pct stop 120 && pct migrate 120 node02
pct stop 122 && pct migrate 122 node02
Each one took a minute or two and started clean on node02. I also moved pihole01 (the HA-managed Pi-hole on node01) to node02 to balance the HA container count.
After all the moves:
- node01: 5 HA containers, 2 non-HA, ~11.7GB allocated
- node02: 5 HA containers, 4 non-HA, ~15.9GB allocated
That looked good. Then I checked Uptime Kuma.
The outage
Multiple monitors were red. DNS was the common thread.
Both Pi-hole containers were on node02, and both were unreachable:
- ct:110 (pihole02) - node02, unreachable
- ct:116 (pihole01) - node02, unreachable
With both DNS resolvers down simultaneously, everything depending on DNS reported offline.
The problem was obvious in hindsight. DNS redundancy in a two-node cluster means one Pi-hole per node. If both land on the same node and that node has a problem - or even just a transient connectivity issue during migration - you lose all DNS at once. I’d been focused on balancing the HA container count and hadn’t thought through what that meant for Pi-hole specifically.
I moved pihole01 back to node01. The container came up but its network interface was in a DOWN state - no IP, no connectivity.
The network interface fix
I could bring the interface up manually:
ip link set eth0 up
That got pihole01 an IP via DHCP and restored connectivity. But it wouldn’t come up automatically on the next container start.
The issue was in /etc/network/interfaces inside the container:
auto eth0
The auto eth0 line tells the networking init to bring up eth0 at boot - but without the iface stanza, it doesn’t know how to configure it. No DHCP instruction, so the interface comes up DOWN with no address.
Adding the missing line fixed it:
auto eth0
iface eth0 inet dhcp
After restarting the container, the interface came up automatically with a full DHCP address. pihole01 was reachable and DNS was restored.
The final distribution
After moving pihole01 back and fixing its network config:
- node01 (6 HA, 2 non-HA): ~12.1GB allocated
- node02 (4 HA, 4 non-HA): ~15.9GB allocated
The HA count ended up 6-4 instead of 5-5 because pihole01 went back to node01. That’s fine - HA balance isn’t about an equal split, it’s about making sure neither node is catastrophically overloaded during a failover. 12.1GB vs 15.9GB is a much healthier spread than 18.8 vs 9.6.
Pi-hole redundancy was back where it should be: pihole01 on node01, pihole02 on node02.
Lessons
Map placement constraints before any migration. Hardware constraints (USB, bind mounts) are obvious - service-level constraints are easier to miss. Any active-active redundant service needs to stay split across nodes. I checked the hardware constraints and missed the service constraint.
HA container count isn’t the only balance metric. I was optimizing for a 5-5 HA container split and caused a DNS outage doing it. The right goal is no catastrophic concentration of services on one node - which isn’t the same as an equal count.
Verify network config after migrating a container. The missing iface eth0 inet dhcp line may have been a pre-existing gap or something that got lost in migration. Either way, connectivity checks after every migration would have caught this before I called it done.
Check dependent services immediately after any migration. I found the outage in Uptime Kuma, but only after moving on to the next task. Testing DNS resolution right after moving a Pi-hole container would have caught the issue in seconds.
This was the follow-up to upgrading both nodes to 32GB RAM. The cluster is running more comfortably now, and I’ve got a better mental model of what needs explicit placement planning vs what can move freely.
Related reading
Upgrading my Proxmox cluster to 32GB per node and testing HA failover
How I upgraded both Lenovo M710q nodes from 4-8GB to 32GB RAM, what I got wrong with Alertmanager silencing, and what the first real HA failover test showed.
From Docker Swarm to Proxmox HA: A Homelab Migration Journey
The story of migrating from a Docker Swarm cluster to a Proxmox HA setup, the planning that made it work, and what I learned about infrastructure changes that matter.
Migrating Pi-hole from a Raspberry Pi to a Proxmox LXC
Replacing pi2.internal (Raspberry Pi 4) with pihole01, a Proxmox LXC container, as the new Pi-hole master. The migration itself was uneventful; the surprises were in TLS, Pi-hole v6 exporter auth, and Grafana label relabeling.
Ready to Transform Your Career?
Let's work together to unlock your potential and achieve your professional goals.