Upgrading my Proxmox cluster to 32GB per node and testing HA failover

My two-node Proxmox cluster had been running comfortably since I migrated from Docker Swarm, but both nodes were RAM-constrained. node01 had 8GB and node02 had 4GB - enough to keep services running, but no headroom. I had two pairs of 16GB DDR4 SODIMM modules ready and a free weekend. This was the upgrade I’d been putting off, and it turned out to be the first real test of the cluster’s HA configuration.

The pre-work audit

Before touching hardware, I did a quick audit of the HA setup.

The qdevice documentation was stale. My runbooks and architecture notes still referenced an old Raspberry Pi as the qdevice host - the actual witness had moved to a container on my QNAP NAS months earlier and I’d never updated the docs. Not a crisis, but I fixed it first. Running maintenance against a wrong mental model of your infrastructure is a bad habit.

I also found that the UniFi Controller container was missing from the HA resource list. My architecture decision record explicitly listed it as a service that should auto-failover, but it had never been configured. I added it to the HA manager before starting the hardware work. This mattered: if it hadn’t been HA-managed, evacuating node02 would have required manual intervention for it.

Maintenance mode and container evacuation

With the audit done, I silenced monitoring and put node02 into maintenance mode:

ssh root@node01.internal "ha-manager crm-command node-maintenance enable node02"

Maintenance mode tells the Proxmox HA manager to move containers off the node gracefully. The HA-managed containers - UniFi, Homepage, Glance - migrated automatically to node01. That part was smooth.

One container couldn’t migrate: the Proxmox Backup Server container. It has a bind mount to NAS storage, and bind mounts block live migration. I stopped it manually. Since PBS isn’t HA-managed and only runs scheduled backup jobs, leaving it stopped for a few hours was fine. The cluster stayed quorate throughout.

Alert silencing: specific matchers are a trap

I set up an Alertmanager silence for node02 before the shutdown. My initial matcher was:

instance="node02.internal:9100"

That only covers the node exporter on port 9100. The moment I powered down node02, alerts started firing from other sources. I scrambled to create additional silences after the fact - exactly backwards.

The right approach for per-host exporters is a regex matcher that catches everything running on the node:

instance=~"node02.internal.*"

That covers node_exporter, Alloy, and anything else that reports in with that hostname. One caveat: if you’re running a centralized Proxmox VE exporter (prometheus-pve-exporter), its alerts carry the pve-exporter container as the instance, not the node hostname. Those need a separate matcher like node="node02". I updated my maintenance runbook to use both patterns going forward.

The RAM installation

With node02 shut down, I opened the case and installed both 16GB modules. When I powered it back on and checked:

ssh root@node02.internal "free -h"

Sixteen gigabytes. Not thirty-two.

ssh root@node02.internal "dmidecode -t memory | grep -A 5 'Memory Device'"

Slot A was empty. I’d assumed both modules were seated, but only one had clicked into place. The Lenovo M710q has two SODIMM slots stacked close together, and it’s easy to feel confident about a module that isn’t fully seated. I powered node02 back down, reseated both modules properly, and got 32GB (reported as 31 GiB, which is correct for two 16GB physical modules) on the next boot.

HA failover test

After node02 reintegrated with maintenance mode disabled, I tested unplanned failover by cutting power to it:

ssh root@node02.internal "poweroff"

The cluster handled it exactly as designed. The qdevice witness - a lightweight service running on my QNAP NAS - kept the cluster quorate even with node02 gone. All HA-managed containers had been on node01 since the earlier evacuation, so they kept running without interruption. I verified the services were accessible from outside the cluster, and they were.

One behavior worth calling out: failback is disabled in my configuration. When node02 came back up, none of the containers moved back. They stayed on node01. This is intentional. Automatic failback means two disruption events - one during the failure, one during recovery. I’d rather control the timing of the second one manually.

node01 upgrade

With node02 fully validated, I ran the same procedure for node01: silence monitoring, maintenance mode, evacuate to node02, power down, install RAM, rejoin, verify. The modules seated correctly on the first attempt this time.

After both upgrades, the cluster had 64GB of RAM distributed across two nodes. Both nodes stable, HA validated, qdevice witness confirmed working through two separate node failures.

Reflection

The physical work was simple once I sorted out the RAM seating mistake - that’s a me problem, not hardware. The more interesting parts were the audit findings and the alert silencing lesson.

On silencing: the specific-instance pattern (node:9100) is technically correct but practically wrong. The moment something beyond the node exporter fires, you’re handling alerts while trying to handle maintenance. Regex matchers on the hostname are the right call any time you’re taking a whole node offline.

The HA test was satisfying in a way that the months of setup hadn’t been. I’d configured replication, set up the qdevice, listed containers in the HA manager - all without ever knowing if it would actually work. It did. The qdevice earned its place in the stack.

The pre-work audit also changed how I think about planned maintenance. I went in expecting to install RAM and came out having fixed stale documentation and a missing HA resource. Both were harmless in normal operation and would have mattered in an actual emergency. Running that kind of audit before maintenance windows is on the checklist now.

Lessons learned:

Use regex matchers for Alertmanager silences: instance=~"nodename.internal.*" covers all per-host exporters; centralized scrapers (like a pve-exporter) need an additional node="nodename" matcher
Create silences before shutdown - not after alerts start firing
SODIMM slots can feel seated when they’re not; verify with dmidecode before reassembling
HA-managed containers evacuate automatically in maintenance mode; containers with bind mounts to NAS storage don’t and need to be stopped manually
Failback disabled is the right default; two disruptions is worse than one
Run a docs and HA config audit before planned maintenance; stale entries don’t surface until you need them