Making the PBS NFS mount self-healing: auto-recovery with a systemd timer
I was mid-way through rebalancing containers across the Proxmox cluster when I noticed the Proxmox Backup Server datastore was showing as inactive. Not degraded, not slow - inactive. PBS couldn’t see its storage.
This is the kind of failure you want to catch before you need to restore something.
The problem
The PBS container uses an NFS mount to access a share on the NAS. All the backup chunks live there. When the mount isn’t healthy, the nas-backups datastore goes inactive and backups fail.
The mount was configured in fstab and had been working. Something - probably a network hiccup during the container migration work - had caused it to drop, and it hadn’t recovered.
mount | grep pbs
The NFS share wasn’t listed. The mount point existed, but a local filesystem was mounted there instead of the NFS export. Backups had been silently failing.
Manual fix was simple:
umount -l /mnt/pbs-backups
mount /mnt/pbs-backups
The datastore went active immediately. But “manually fix it when I notice” is not a backup strategy.
Why NFS mounts need explicit resilience
The default mount behavior is fire-and-forget. If the mount succeeds at boot, great. If something causes it to drop later - NAS reboot, network interruption, a brief link flap - the mount stays gone until someone intervenes or the system reboots.
In an LXC container, there’s also a race condition at boot: the container’s virtual network interface comes up almost instantly, but that doesn’t mean routing and DNS are actually ready. An NFS mount configured with _netdev can fail silently because the kernel thinks the network is up when it isn’t yet reachable.
What I needed:
- Handle stale mount points without getting stuck in a blocking state
- Verify the NFS share is actually reachable before trying to mount
- Retry on failure rather than giving up on the first error
- Run on a schedule to catch drops that happen while the system is already up
The solution
Three pieces: hardened fstab options, an auto-remount script, and a systemd timer that checks every hour.
Fstab options
nas.internal:/share/pbs-backups /mnt/pbs-backups nfs rw,vers=3,x-systemd.automount,_netdev,soft,timeo=30,retrans=3,noresvport 0 0
What these add over a default NFS entry:
x-systemd.automount- defers the actual mount until first access, which sidesteps the LXC boot race condition described above: the kernel marks the network up before routing and DNS are ready, but the mount only happens when something first touches the pathsoft- instead of hanging indefinitely when the NFS server is unreachable, operations fail with an error after the timeout. This prevents mount operations from blocking the container.timeo=30- timeout in tenths of a second (3 seconds) before retrying an NFS requestretrans=3- retransmissions before the client gives up and returns an errornoresvport- don’t require a privileged source port; avoids firewall issues on reconnect
Auto-remount script
The script lives at /usr/local/bin/pbs-nas-remount.sh:
#!/bin/bash
MOUNT_POINT="/mnt/pbs-backups"
NAS_HOST="nas.internal"
MAX_RETRIES=3
RETRY_DELAY=5
# Check network connectivity first
if ! ping -c 1 -W 3 "$NAS_HOST" > /dev/null 2>&1; then
echo "NAS unreachable, skipping remount"
exit 1
fi
# Check if already mounted and healthy
if mountpoint -q "$MOUNT_POINT" && [ -w "$MOUNT_POINT" ]; then
exit 0
fi
# Attempt remount with retries
for i in $(seq 1 $MAX_RETRIES); do
umount -l "$MOUNT_POINT" 2>/dev/null
if mount "$MOUNT_POINT" 2>/dev/null; then
if mountpoint -q "$MOUNT_POINT" && [ -w "$MOUNT_POINT" ]; then
echo "Remount successful"
exit 0
fi
fi
sleep $RETRY_DELAY
done
echo "Remount failed after $MAX_RETRIES attempts"
exit 1
The network check happens first. If the NAS isn’t reachable, the script exits cleanly rather than accumulating failed mount attempts. The mountpoint -q check at the end verifies the mount is actually writable; the mount command’s exit code alone can’t be trusted. (NFS mounts with nofail can return 0 on failure; always check the mount table.)
Systemd timer
/etc/systemd/system/pbs-nas-remount.service:
[Unit]
Description=PBS NAS NFS Auto-Remount
After=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/pbs-nas-remount.sh
/etc/systemd/system/pbs-nas-remount.timer:
[Unit]
Description=PBS NAS NFS Remount Timer
[Timer]
OnBootSec=5min
OnUnitActiveSec=1h
[Install]
WantedBy=timers.target
OnBootSec=5min gives the container time to finish booting and establish full network connectivity before the first check. After that, OnUnitActiveSec=1h runs it every hour - frequent enough to catch a dropped mount, infrequent enough to let the NAS spin down between checks.
Ansible automation
All of this deploys via the proxmox-backup-server Ansible role: fstab entry, script, systemd units, timer enabled. Any new PBS container gets the full resilience stack automatically. (The Ansible role splits the mount differently - the host mounts the NFS share at /mnt/pbs-backups and the PBS container sees it at /mnt/datastores/pbs-nas via a bind mount, rather than the container mounting NFS directly.)
Verification
After deploying:
systemctl status pbs-nas-remount.timer
systemctl list-timers pbs-nas-remount.timer
Timer active, next trigger in ~1 hour. The nas-backups datastore showed active in the PBS web UI on both nodes.
I tested recovery by unmounting manually and waiting. Within the hour the script ran, verified NAS reachability, remounted, and the datastore went active again without any manual intervention.
Lessons
Silent mount failures are a backup reliability problem. If PBS loses its datastore and nobody notices, backups fail quietly. By the time you need to restore, the last successful backup could be days old.
Verify with mountpoint -q, not the exit code. NFS (and CIFS) mounts can return success even when the mount didn’t happen, especially with nofail. mountpoint -q checks the kernel’s mount table and isn’t fooled.
The systemd timer pattern is more reliable than cron for post-boot work. With cron, a system restart can mean the job doesn’t run until the next scheduled slot. OnBootSec in a systemd timer fires reliably after boot, with a delay you control.
Network check before mount attempt. Trying to mount when the NAS isn’t reachable just accumulates errors. A ping check first keeps the logs clean and prevents the mount state from getting stuck.
The PBS initial setup and NFS storage architecture are covered in Integrating Proxmox Backup Server with the cluster.
Related reading
Integrating Proxmox Backup Server with the Cluster: Decisions and Troubleshooting
Decision rationale for PBS integration, NFS storage architecture, troubleshooting storage disconnection issues, and implementing monitoring alerts for backup reliability.
Setting Up Gitea as a GitHub Backup
Why I mirror GitHub into self-hosted Gitea on Proxmox, how Ansible and Docker-in-LXC fit together, and the gotchas that showed up along the way.
Consolidating audiobooks and ebooks into a single Audiobookshelf
I was running two media servers, Audiobookshelf for audiobooks and Kavita for ebooks, when one could do both. Rebuilding the homelab in v3 was the excuse to merge them: one Ansible-deployed Audiobookshelf, local-disk storage, and a USB-drive ZFS scare in the middle of the migration.
Ready to Transform Your Career?
Let's work together to unlock your potential and achieve your professional goals.