Skip to content
Infrastructure

Making the PBS NFS mount self-healing: auto-recovery with a systemd timer

By Victor Da Luz
proxmox backup pbs homelab nfs systemd ansible infrastructure

I was mid-way through rebalancing containers across the Proxmox cluster when I noticed the Proxmox Backup Server datastore was showing as inactive. Not degraded, not slow - inactive. PBS couldn’t see its storage.

This is the kind of failure you want to catch before you need to restore something.

The problem

The PBS container uses an NFS mount to access a share on the NAS. All the backup chunks live there. When the mount isn’t healthy, the nas-backups datastore goes inactive and backups fail.

The mount was configured in fstab and had been working. Something - probably a network hiccup during the container migration work - had caused it to drop, and it hadn’t recovered.

mount | grep pbs

The NFS share wasn’t listed. The mount point existed, but a local filesystem was mounted there instead of the NFS export. Backups had been silently failing.

Manual fix was simple:

umount -l /mnt/pbs-backups
mount /mnt/pbs-backups

The datastore went active immediately. But “manually fix it when I notice” is not a backup strategy.

Why NFS mounts need explicit resilience

The default mount behavior is fire-and-forget. If the mount succeeds at boot, great. If something causes it to drop later - NAS reboot, network interruption, a brief link flap - the mount stays gone until someone intervenes or the system reboots.

In an LXC container, there’s also a race condition at boot: the container’s virtual network interface comes up almost instantly, but that doesn’t mean routing and DNS are actually ready. An NFS mount configured with _netdev can fail silently because the kernel thinks the network is up when it isn’t yet reachable.

What I needed:

  • Handle stale mount points without getting stuck in a blocking state
  • Verify the NFS share is actually reachable before trying to mount
  • Retry on failure rather than giving up on the first error
  • Run on a schedule to catch drops that happen while the system is already up

The solution

Three pieces: hardened fstab options, an auto-remount script, and a systemd timer that checks every hour.

Fstab options

nas.internal:/share/pbs-backups /mnt/pbs-backups nfs rw,vers=3,x-systemd.automount,_netdev,soft,timeo=30,retrans=3,noresvport 0 0

What these add over a default NFS entry:

  • x-systemd.automount - defers the actual mount until first access, which sidesteps the LXC boot race condition described above: the kernel marks the network up before routing and DNS are ready, but the mount only happens when something first touches the path
  • soft - instead of hanging indefinitely when the NFS server is unreachable, operations fail with an error after the timeout. This prevents mount operations from blocking the container.
  • timeo=30 - timeout in tenths of a second (3 seconds) before retrying an NFS request
  • retrans=3 - retransmissions before the client gives up and returns an error
  • noresvport - don’t require a privileged source port; avoids firewall issues on reconnect

Auto-remount script

The script lives at /usr/local/bin/pbs-nas-remount.sh:

#!/bin/bash
MOUNT_POINT="/mnt/pbs-backups"
NAS_HOST="nas.internal"
MAX_RETRIES=3
RETRY_DELAY=5

# Check network connectivity first
if ! ping -c 1 -W 3 "$NAS_HOST" > /dev/null 2>&1; then
    echo "NAS unreachable, skipping remount"
    exit 1
fi

# Check if already mounted and healthy
if mountpoint -q "$MOUNT_POINT" && [ -w "$MOUNT_POINT" ]; then
    exit 0
fi

# Attempt remount with retries
for i in $(seq 1 $MAX_RETRIES); do
    umount -l "$MOUNT_POINT" 2>/dev/null
    if mount "$MOUNT_POINT" 2>/dev/null; then
        if mountpoint -q "$MOUNT_POINT" && [ -w "$MOUNT_POINT" ]; then
            echo "Remount successful"
            exit 0
        fi
    fi
    sleep $RETRY_DELAY
done

echo "Remount failed after $MAX_RETRIES attempts"
exit 1

The network check happens first. If the NAS isn’t reachable, the script exits cleanly rather than accumulating failed mount attempts. The mountpoint -q check at the end verifies the mount is actually writable; the mount command’s exit code alone can’t be trusted. (NFS mounts with nofail can return 0 on failure; always check the mount table.)

Systemd timer

/etc/systemd/system/pbs-nas-remount.service:

[Unit]
Description=PBS NAS NFS Auto-Remount
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/pbs-nas-remount.sh

/etc/systemd/system/pbs-nas-remount.timer:

[Unit]
Description=PBS NAS NFS Remount Timer

[Timer]
OnBootSec=5min
OnUnitActiveSec=1h

[Install]
WantedBy=timers.target

OnBootSec=5min gives the container time to finish booting and establish full network connectivity before the first check. After that, OnUnitActiveSec=1h runs it every hour - frequent enough to catch a dropped mount, infrequent enough to let the NAS spin down between checks.

Ansible automation

All of this deploys via the proxmox-backup-server Ansible role: fstab entry, script, systemd units, timer enabled. Any new PBS container gets the full resilience stack automatically. (The Ansible role splits the mount differently - the host mounts the NFS share at /mnt/pbs-backups and the PBS container sees it at /mnt/datastores/pbs-nas via a bind mount, rather than the container mounting NFS directly.)

Verification

After deploying:

systemctl status pbs-nas-remount.timer
systemctl list-timers pbs-nas-remount.timer

Timer active, next trigger in ~1 hour. The nas-backups datastore showed active in the PBS web UI on both nodes.

I tested recovery by unmounting manually and waiting. Within the hour the script ran, verified NAS reachability, remounted, and the datastore went active again without any manual intervention.

Lessons

Silent mount failures are a backup reliability problem. If PBS loses its datastore and nobody notices, backups fail quietly. By the time you need to restore, the last successful backup could be days old.

Verify with mountpoint -q, not the exit code. NFS (and CIFS) mounts can return success even when the mount didn’t happen, especially with nofail. mountpoint -q checks the kernel’s mount table and isn’t fooled.

The systemd timer pattern is more reliable than cron for post-boot work. With cron, a system restart can mean the job doesn’t run until the next scheduled slot. OnBootSec in a systemd timer fires reliably after boot, with a delay you control.

Network check before mount attempt. Trying to mount when the NAS isn’t reachable just accumulates errors. A ping check first keeps the logs clean and prevents the mount state from getting stuck.


The PBS initial setup and NFS storage architecture are covered in Integrating Proxmox Backup Server with the cluster.

Related reading

Infrastructure

Setting Up Gitea as a GitHub Backup

Why I mirror GitHub into self-hosted Gitea on Proxmox, how Ansible and Docker-in-LXC fit together, and the gotchas that showed up along the way.

Read
Infrastructure

Consolidating audiobooks and ebooks into a single Audiobookshelf

I was running two media servers, Audiobookshelf for audiobooks and Kavita for ebooks, when one could do both. Rebuilding the homelab in v3 was the excuse to merge them: one Ansible-deployed Audiobookshelf, local-disk storage, and a USB-drive ZFS scare in the middle of the migration.

Read

Ready to Transform Your Career?

Let's work together to unlock your potential and achieve your professional goals.