The Proxmox Cluster Join Bug That Wasn't What It Seemed - Victor Da Luz

I’ve wanted to try Proxmox for a long time. The idea of running VMs and containers on dedicated hardware instead of Docker Desktop or QEMU appealed to me, but I only had a Mac mini and Raspberry Pis in my homelab. Proxmox doesn’t play well with ARM, so I kept putting it off.

That changed when I found a couple of cheap Lenovo M710q mini PCs. These tiny x86 machines are perfect for Proxmox - they’re small, quiet, and have enough power for homelab workloads. I pulled the trigger and started setting up my first Proxmox cluster.

This is the first in what will be a series about my Proxmox setup, the services I’ll run there, and the lessons I learn along the way. Today’s story is about a mysterious bug that looked like a TLS problem but turned out to be something much simpler.

What Was Broken

I tried to join a fresh Proxmox VE 9 cluster and kept hitting a blocking error:

TASK ERROR: 500 Server closed connection without sending any data back
Variants showed up as TLS handshake errors from node02 to node01 on port 8006

The cluster join would start, then fail partway through. Sometimes it would get further, sometimes it would fail immediately. The error messages pointed to TLS problems, but that didn’t make sense for a fresh installation.

What I Observed

The first clue came when I noticed packet loss between the two Lenovo nodes. No loss to the Pis, just between the two new machines. That seemed odd for hardware on the same switch, but these are refurbished machines so that was a clue that there might be something going on with the hardware.

Then I ran some tests:

pvecm add failed with TLS errors
curl -k https://node01:8006/api2/json/version sometimes stalled
Both nodes reported the exact same MAC on the onboard NIC and the bridge

That last part caught my attention. Both boxes showed 6c:4b:90:16:36:73 for both enp0s31f6 and vmbr0. Identical hardware, identical MAC addresses. In hindsight, I did observe intermittent delays when accessing the nodes via ssh but dismissed it, I just wrongly assumed that the VSCode terminal was bugging out and I needed to restart it to fix it.

What The Network Showed

Two hosts advertising the same L2 address cause MAC flapping and ARP cache oscillation on the switch. That explained the intermittent packet loss and why TLS handshakes would reset mid-flight.

The switch couldn’t decide which port to send traffic to, so it would bounce between them. Sometimes packets would reach the right destination, sometimes they wouldn’t. TLS connections are sensitive to this kind of interruption - they need reliable delivery to complete the handshake.

What Finally Happened

I temporarily set a different MAC on node02’s bridge to test my theory:

ip link set dev vmbr0 address 02:50:00:00:00:02

The packet loss disappeared immediately. The cluster join succeeded on the next attempt.

The Root Cause

Identical hardware can ship with duplicated NIC EEPROMs. Both Lenovo M710q units had the same burned-in MAC address. When I created the bridge interfaces, they inherited the same MAC as the physical NIC.

This created a classic MAC collision on the network. The switch would learn the MAC on one port, then see it advertised from another port, then flip back and forth. The result was intermittent connectivity that broke TLS handshakes.

The Fix

I proved the fix worked first, then made it persistent. Here’s what I did:

Step 1: Prove it quickly on node02 (temporary until reboot):

ip link set dev vmbr0 address 02:50:00:00:00:02

Step 2: Persist on node02 with ifupdown2:

# /etc/network/interfaces
iface enp0s31f6 inet manual
    hwaddress ether 02:50:00:00:00:02

iface vmbr0 inet static
    address 10.0.x.18/24
    gateway 10.0.x.1
    bridge-ports enp0s31f6
    bridge-stp off
    bridge-fd 0
    bridge_hw 02:50:00:00:00:02

Apply and verify:

ifreload -a
ip link show vmbr0 | sed -n '1,3p'

I left node01 with its burned-in MAC and only changed node02. That was enough to eliminate the collision.

Lessons Learned

This bug looked like a TLS problem at first. It wasn’t. The real lessons here:

If cluster join fails with TLS resets on 8006, check for packet loss first. Layer 2 problems often masquerade as higher-layer issues.
Identical hardware can ship with duplicated NIC EEPROMs. Always confirm unique MACs on physical NICs and on bridges.
Set a locally administered address (LAA) to escape collisions quickly. The 02:50:00:00:00:02 pattern I used is a valid LAA that won’t conflict with burned-in addresses.
This bug looked like a TLS problem at first. It wasn’t. Start at layer 2. When things don’t work reliably, check the fundamentals first.

I wonder how many small business with small IT departments experience this sort of network availability for the same reason.

What’s Next

The cluster is running now, and I’m excited to start migrating services from Docker Swarm to Proxmox. The Lenovo M710q units are surprisingly capable - they handle multiple VMs without breaking a sweat.

I’ll be writing more about the Proxmox setup as I go. Next up will be the initial cluster configuration, storage setup, and migrating my first services. There’s something satisfying about running VMs on dedicated hardware instead of containers on a Mac.