Integrating Proxmox Backup Server with the Cluster: Decisions and Troubleshooting - Victor Da Luz

I needed to move beyond local snapshots piling up on cluster nodes and integrate proper backup infrastructure. The Proxmox cluster was creating snapshots locally, but there was no centralized backup strategy or deduplication. Proxmox Backup Server (PBS) offered centralized, deduplicated backup storage that would integrate directly with the cluster.

This is the decision rationale for PBS integration, the NFS storage architecture, and troubleshooting the storage disconnection issues that required monitoring alerts.

The backup problem

Local snapshots were accumulating on cluster node disks without any centralized management. Each node was storing snapshots independently, which meant no deduplication, limited retention control, and snapshots consuming valuable local storage space.

I needed a centralized backup solution that would integrate with Proxmox. The solution needed to support deduplication to maximize storage efficiency, provide direct Proxmox integration, and work within the existing infrastructure constraints.

Proxmox Backup Server was the natural choice. It’s designed specifically for Proxmox, supports efficient deduplication, integrates directly with the Proxmox datastore system, and can run as an LXC container on the cluster itself.

Architecture decision: PBS with NFS storage

I deployed PBS as an LXC container on node02. This kept the backup infrastructure within the cluster while maintaining separation from the nodes being backed up. The container runs PBS and handles the backup coordination.

Storage came from the NAS via NFS mount. The NAS already provided reliable, redundant storage, so mounting an NFS export made sense for persistent storage. PBS would handle deduplication and backup management, while the NAS provided the underlying storage capacity.

This approach has a known performance trade-off. PBS generates millions of small files for its chunk-store format, which requires high random IOPS and low latency for operations like garbage collection and pruning. NFS introduces latency overhead for these metadata-heavy operations, making maintenance jobs and restores slower than they would be on local SSD or block storage like iSCSI. For a homelab environment with lower backup frequency, this trade-off is acceptable, but production deployments should consider faster storage backends.

The architecture does separate concerns: PBS handles backup logic, deduplication, and Proxmox integration, while the NAS provides persistent storage. If PBS needs to be rebuilt or moved, the backup data remains on the NAS, accessible to a new PBS instance.

The integration required adding the PBS datastore to Proxmox. Once PBS was running, I added it as a storage target using the pvesm add pbs command, providing credentials and the datastore name. Proxmox immediately recognized it as a backup target, allowing scheduled backups and manual snapshots.

A test backup confirmed the integration worked. Running a manual snapshot to the PBS storage showed successful backup completion and verified that Proxmox could write backups to PBS, which would then store them on the NAS.

Troubleshooting: Storage disconnection

After restarting the PBS container, the pbs-nas storage showed as disconnected in Proxmox. The web interface indicated the storage was unavailable, which would prevent backups from running.

The issue was that the NFS mount wasn’t automatically remounted after the container restart. The mount point at /mnt/pbs-backups existed, but the NFS export wasn’t mounted. A quick mount -a fixed it immediately, but this was a problem that needed a permanent solution.

The mount was initially configured with the _netdev option in fstab. However, in LXC containers, _netdev is often ineffective because the container’s virtual network interface appears “up” almost instantly during boot, before full network reachability (DNS/routing) is actually established. This race condition causes the mount to fail silently during startup.

The solution was to use x-systemd.automount in the fstab entry. This creates an on-demand mount point that mounts upon first access, bypassing the boot-time network race condition entirely. The mount configuration uses both x-systemd.automount and _netdev options to ensure proper network-aware mounting behavior.

I needed visibility into when this happened. If the mount failed silently, backups would fail and there would be no notification until someone checked the Proxmox storage status. This wasn’t acceptable for backup infrastructure.

Implementing monitoring alerts

I needed visibility into mount status, but there’s a monitoring challenge with LXC containers. Standard Proxmox monitoring stacks run node_exporter on the Proxmox host, which doesn’t have visibility into the mount namespaces of LXC containers. The host’s node_exporter can’t see mounts inside container filesystems.

The solution requires mounting the NFS share on the Proxmox host and bind-mounting it into the LXC container. This approach ensures the host’s node_exporter can monitor the mount status while the container still has access to the storage. Alternatively, you could install a separate node_exporter instance inside the LXC, but bind-mounting from the host is cleaner and provides better integration with the existing monitoring stack.

I added Prometheus alert rules to monitor the PBS NFS mount status on the host. The alerts use node-exporter filesystem metrics to detect when the mount is missing or read-only.

Two critical alerts were created:

PBSNFSMountMissing: Triggers when the NFS mount isn’t available on the host. This uses node_filesystem_size_bytes metric - if the mount point doesn’t appear in the metrics, the mount is missing.
PBSNFSMountReadOnly: Triggers when the mount becomes read-only. This uses node_filesystem_readonly metric to detect when the filesystem becomes read-only, which would allow reads but prevent backup writes.

Both alerts have a 2-minute threshold to avoid false positives during normal operations. The alerts deploy via the Prometheus-Grafana Ansible playbook, keeping monitoring configuration in the infrastructure-as-code repository.

This provides immediate notification when backup storage becomes unavailable. Instead of discovering failed backups later, alerts trigger as soon as the mount disconnects, allowing quick remediation.

Scheduling and retention strategy

Once PBS integration was stable, I configured scheduled backups. Following Proxmox backup best practices, I created a nightly backup job that snapshots service containers to PBS with compression and retention policies.

The backup schedule uses snapshot mode with zstd compression. This provides fast backups without stopping containers, and zstd compression balances speed and storage efficiency.

Retention policy follows a standard approach: Keep last 7 days, 4 weekly backups, and 6 monthly backups. This provides recent recovery options while maintaining longer-term backups without excessive storage consumption. PBS’s deduplication means these retention policies are storage-efficient.

PBS-side maintenance jobs complement the backup schedule. A prune job runs early morning to clean up old backups according to retention policies, followed by garbage collection to reclaim space from deduplicated chunks. A verification job runs nightly to check backup integrity.

This creates a complete backup workflow: Nightly backups from Proxmox, automatic pruning and garbage collection on PBS, and integrity verification to ensure backups remain recoverable.

Real-world validation

The backup system has already proven its value in practice. I’ve had to restore containers from PBS backups several times after configuration mistakes or accidental changes that broke services. Each restore has worked perfectly - the container comes back exactly as it was at the backup point, with all configuration and data intact.

The restore process is straightforward: Select the backup point from the Proxmox interface, choose restore options, and the container is restored from PBS. The deduplicated storage means restore times are fast, and the backup integrity verification ensures the backups are actually recoverable.

This real-world validation confirms the backup strategy is working as intended. Having successfully restored multiple times demonstrates that the backups are reliable and the restore process works when needed. It’s the difference between hoping backups work and knowing they do.

Monitoring considerations

I configured Uptime Kuma to monitor PBS availability. The monitor checks PBS directly via the .lan domain rather than through Traefik, ensuring that Traefik outages don’t show PBS as down. This provides accurate monitoring of the backup infrastructure independent of the reverse proxy.

Traefik integration provides HTTPS access to the PBS web UI. The UI is accessible via the wildcard certificate at pbs.vdaluz.net for remote management, while internal monitoring uses direct .lan access.

The monitoring setup ensures backup infrastructure issues are visible immediately. Between Prometheus alerts for storage disconnection and Uptime Kuma for service availability, there’s comprehensive visibility into backup infrastructure health.

What’s next

The PBS integration is complete and backups are running on schedule. The storage disconnection monitoring provides early warning when mounts fail, and the scheduled backups ensure regular snapshots with appropriate retention.

Future improvements could include: Backup verification automation, cross-node backup testing, and potentially expanding backup coverage to include VM backups in addition to containers. The foundation is solid, and the infrastructure can scale as backup requirements grow.

The key lesson is that backup infrastructure needs monitoring just like the services being backed up. Storage disconnections can happen silently, and without monitoring, you only discover the problem when you need to restore. Alerting on mount status provides the visibility needed to maintain backup reliability.