Skip to content
Infrastructure

Fixing Proxmox replication when ZFS has no common base snapshot

By Victor Da Luz
proxmox zfs homelab replication troubleshooting lxc

Proxmox VE replication between cluster nodes is built on ZFS snapshots. Each run expects the source and target to share a snapshot the job can use as a common parent. When that chain breaks, the scheduler does the right thing and refuses to guess. The error message is blunt: No common base snapshot on volume(s).

This post is what happened when a scheduled replication job for a secondary Pi-hole replica CT hit that wall, and how I put replication back on rails without rebuilding the container from scratch.

What broke

I run a two-node cluster with ZFS-backed container storage. One node is the primary for most CTs; replication pushes copies to the other on a fixed interval. For the Pi-hole replica (CT 110 in my lab), the job was set to run every fifteen minutes.

Replication started failing in the task log. The summary line was the same every time: no common base snapshot on the volume backing that CT. Mail and monitoring were fine; this was storage sync only. Still, I did not want a stale copy on the second node if I ever needed it.

What that error actually means

Proxmox replication is incremental. It creates snapshots on the source, sends the diff to the target, and both sides keep the metadata that ties those snapshots together. If someone deletes a snapshot that the job still considers “current,” or a send is interrupted badly enough, the source and target datasets no longer agree on where to resume.

At that point the replication driver has nothing safe to diff against. It stops with No common base snapshot rather than risk sending a patch that would corrupt the destination.

Common ways to get there:

  • Manual zfs destroy on a replication-related snapshot
  • Interrupted or partial sync that left names out of sync
  • A job that kept half-updating state across nodes after storage maintenance

The fix is not to “force” a blind full send from the UI in a panic. It is to make the snapshot timeline consistent again, then either re-seed or recreate the job so Proxmox can establish a fresh baseline.

What I checked first

On the node that owns the CT (the replication source), I listed replication jobs:

pvesr list

That showed the job id (in my case 110-1), the target node, and the schedule. I confirmed the job still pointed at the right peer and storage id.

On both nodes I inspected snapshots for the dataset backing CT 110:

zfs list -t snapshot | grep -E '110|SUBVOL'

You will want to match the actual ZFS path for that CT’s disk on your pool; the exact string depends on how the volume was created. What I looked for was pairs of names that should have matched between nodes and any obvious orphan snapshots on one side only.

The mismatch was clear enough: the target side was missing snapshots the job still expected, or had a different head snapshot name than the source. Either way, there was no shared parent for the next incremental.

What I changed

I treated this as a controlled reset of replication state for that CT, not as a mystery to hack around on a live filesystem.

1. Delete the broken replication job

Removing the job stops Proxmox from firing failing attempts while I clean up. Job ids show up in pvesr list.

pvesr delete 110-1

Use your real job id.

2. Remove the problematic snapshots

This is the step to treat with care. I only destroyed snapshots that were clearly tied to the stuck replication chain for that CT’s volume, not random auto-snapshots I still wanted for other reasons.

zfs destroy pool/subvol@snapshotname

If ZFS complains about dependents, list children with zfs list -r and remove snapshots in dependency order, or use the recursive destroy flags only when you understand what else lives under that dataset. In a homelab with a single CT volume, the blast radius is usually small. In production you would snapshot the whole pool or take a backup before destroying anything.

3. Recreate the replication job

I recreated a local replication job with the same semantics as before: same CT, same target node, same fifteen-minute cadence.

pvesr create-local-job 110-1 node02 --schedule '*/15'

The job id (110-1 here) is the replication job name, not only the VMID. Match whatever naming scheme you used previously. Replace node02 with your target cluster node name. If your Proxmox version wants a full cron line instead of the short form, copy the schedule from another working job or from the docs for your release.

4. Run it once on demand

After the job existed again, I triggered a manual sync:

pvesr schedule-now 110-1

The first successful run after a reset may take longer because it is effectively establishing a new baseline. I watched the task log until it finished without the “no common base snapshot” error.

What I verified

  • pvesr list showed the job enabled with the expected schedule.
  • A manual pvesr schedule-now completed successfully.
  • Subsequent scheduled runs stayed green; snapshot lists on source and target looked coherent on a quick spot check.

I did not need to rebuild the CT or restore from Proxmox Backup Server for this class of failure. The data on the source was always fine; only the replication relationship was confused.

What I would do differently next time

  • Treat replication snapshots as part of the job contract. If I delete ZFS snapshots by hand, I will assume replication for that volume may need a reset unless I know the job’s current bookmark state.
  • Keep a one-line note per critical CT with the replication job id and schedule. That saves guessing during an incident.
  • Fix the job before thrashing storage. Running a broken job repeatedly does not help; clearing the job and snapshots deliberately does.

Proxmox and ZFS give you a lot of safety rails. No common base snapshot is one of them. It is annoying when you are tired, but it is preferable to silently corrupting a replica. After deleting the job, cleaning the snapshot chain, and recreating the job, replication for that Pi-hole CT was boring again, which was exactly the goal.

Related reading

Infrastructure

Consolidating audiobooks and ebooks into a single Audiobookshelf

I was running two media servers, Audiobookshelf for audiobooks and Kavita for ebooks, when one could do both. Rebuilding the homelab in v3 was the excuse to merge them: one Ansible-deployed Audiobookshelf, local-disk storage, and a USB-drive ZFS scare in the middle of the migration.

Read

Ready to Transform Your Career?

Let's work together to unlock your potential and achieve your professional goals.