Setting up Traefik high availability on Raspberry Pi with Keepalived - Victor Da Luz

Traefik is the entry point for all services in my homelab, so it needs to be reliable. Running it on a single machine creates a single point of failure. If that machine goes down, nothing is accessible, even if the backend services are running fine.

I decided to build a high-availability Traefik setup using two Raspberry Pi 4B devices. The hostnames are pi3 and pi4, though both are Raspberry Pi 4B models. This gives me redundancy and automatic failover without adding complexity to the infrastructure.

This is the story of how I set up Traefik with Keepalived for automatic failover, and what I learned about making it work reliably.

Why Raspberry Pi for Traefik?

Traefik is lightweight and stateless, which makes it perfect for Raspberry Pi hardware. It doesn’t need much CPU or memory, and ARM64 support means it runs natively on Pi hardware without emulation overhead.

Both devices were already on the network with proper connectivity. pi4 was idle after Docker Swarm retirement. pi3 runs lightweight services like Pi-hole replica and the Proxmox qdevice, but has capacity for Traefik as well.

Running Traefik on dedicated Raspberry Pi hardware keeps it independent from the Proxmox cluster. This means Traefik continues working even if I need to reboot or troubleshoot Proxmox nodes. The reverse proxy operates independently, which improves overall infrastructure resilience.

Power efficiency is a nice bonus. ARM-based Raspberry Pi devices consume less power than running additional services in Proxmox containers, which matters for 24/7 operation.

My homelab philosophy favors dedicated devices for critical infrastructure. Traefik handles all incoming traffic, so it deserves dedicated hardware with redundancy.

Choosing the high availability approach

I had two options for making Traefik highly available.

Keepalived VIP provides active/passive failover. Both nodes run Traefik, but only one serves traffic at a time through a virtual IP address. The virtual IP floats between nodes based on priority and health checks. Failover is fast, typically 1-3 seconds. There’s no load balancing, but that’s acceptable for homelab traffic levels.

Traefik clustering provides active/active operation. Both nodes serve traffic simultaneously, which sounds ideal. But it requires etcd, consul, or redis for shared state, which adds complexity. It’s more complex to operate and troubleshoot, and feels like overkill for homelab needs.

I chose Keepalived VIP. It’s simpler, well-documented, and matches my preference for straightforward solutions. The active/passive model works fine for homelab traffic levels, and fast failover ensures minimal downtime. Sometimes the simpler solution is the right solution.

The architecture

The virtual IP is 192.168.1.20. This matches the IP that Traefik had before, which minimizes DNS changes. The virtual IP floats between nodes based on Keepalived’s decision.

pi4 is the primary node at 192.168.1.14. It has priority 100 and operates in MASTER state. pi4 runs Traefik and Keepalived, serving all traffic when healthy.

pi3 is the backup node at 192.168.1.13. It has priority 90 and operates in BACKUP state. pi3 also runs Traefik and Keepalived, ready to take over if pi4 fails. It also runs Pi-hole replica and the Proxmox qdevice.

VRRP configuration uses virtual router ID 50. Advertisement interval is set to 1 second for fast failure detection. Health checks monitor Traefik service status, and failover triggers if Traefik stops or the network fails.

Keepalived includes a health check script that monitors Traefik. If Traefik fails on the primary node, Keepalived automatically lowers the priority, causing failover to the backup. This ensures the virtual IP doesn’t stay on a node with a broken service.

Setting up Traefik

Traefik provides official ARM64 builds, so installation was straightforward. The installation script downloads the correct binary for Raspberry Pi architecture. I installed Traefik on both pi3 and pi4 following the same process.

I created a Traefik system user and directories on both nodes. Traefik runs as a non-root user for security, which is a best practice. The configuration files live in /etc/traefik/, and certificates are stored in /etc/traefik/certs/.

Configuration synchronization is critical for HA. Both nodes need identical Traefik configurations, or services might work on one node but not the other. I solved this by storing configs in a git repository at iac/traefik/config/ and using a sync script to deploy to both nodes. This ensures both nodes pull from the same source of truth.

Certificate storage is handled per-node. Each node maintains its own certificate storage in /etc/traefik/certs/acme.json. Traefik automatically requests certificates on both nodes when needed. This avoids the complexity of shared certificate storage while still providing redundancy.

The first challenge was port binding permissions. Traefik runs as a non-root user but needs to bind to privileged ports 80 and 443. The initial error was listen tcp :80: bind: permission denied. I solved this by adding the CAP_NET_BIND_SERVICE capability to the systemd service file, allowing Traefik to bind to these ports without running as root. This is more secure than running as root while still allowing access to the privileged ports.

The port conflict problem

A more significant challenge emerged during setup. pi3 runs Pi-hole, which uses ports 80 and 443 for its admin interface. This creates a conflict with Traefik. When Traefik tries to start on pi3, it can’t bind to these ports because Pi-hole is already using them.

This breaks the HA failover model. If the backup node can’t run Traefik, failover won’t work. The virtual IP might move to pi3, but Traefik wouldn’t be running to serve traffic.

The solution was to reconfigure Pi-hole to use different ports for its web interface. Pi-hole v6 uses an internal web server managed by pihole-FTL, and the ports are configured via the webserver.port setting in /etc/pihole/pihole.toml. I changed the configuration from "80r,443s" to "8080r,8443s", moving the web interface to ports 8080 for HTTP and 8443 for HTTPS while keeping DNS on port 53.

I applied this change to both pi2 and pi3 to maintain configuration consistency. After restarting pihole-FTL on both nodes, Traefik was able to start successfully on pi3, completing the HA setup. The Pi-hole admin interface is now accessible at http://<pi-ip>:8080/admin or https://<pi-ip>:8443/admin, and Traefik uses the standard ports 80 and 443 on both nodes.

This demonstrates the importance of understanding service dependencies and planning for resource conflicts. When multiple services share infrastructure, port conflicts can break high availability setups. The solution required coordinating changes across multiple services, but it was necessary to make the HA configuration work correctly.

Configuring Keepalived

Keepalived configuration is straightforward once you understand VRRP. I configured both nodes with matching settings, except for priorities and state.

VRRP requires authentication between nodes. I generated a secure random password and configured it identically on both nodes. This prevents unauthorized nodes from joining the virtual router group.

The configuration includes the health check script. This script monitors Traefik’s systemd service status and reports to Keepalived. If Traefik fails, the script causes Keepalived to lower the node’s priority, triggering failover.

Advertisement interval is set to 1 second. This provides fast failure detection without creating excessive network traffic. For a homelab environment, this balance works well.

Virtual router ID 50 is unique on the network. Each VRRP group needs a unique router ID to avoid conflicts with other VRRP instances that might be running.

Testing failover

After installation, I tested several failover scenarios to verify the setup worked correctly.

Traefik service failure was the first test. Stopping Traefik on pi4 causes Keepalived to detect the failure and move the virtual IP to pi3 within seconds. Traffic continues flowing, and users don’t notice the switch. Restarting Traefik on pi4 causes the virtual IP to return after the health check confirms Traefik is running again.

Node reboot was the second test. Rebooting pi4 causes immediate failover to pi3. When pi4 comes back online, the virtual IP returns after Traefik starts and the health check passes. The failover happens automatically without intervention.

Network disconnect was the third test. This scenario demonstrates how VRRP failover works when the primary node loses network connectivity. When pi4’s network interface goes down, Keepalived on pi4 detects the link failure immediately. Keepalived monitors the physical interface that carries VRRP advertisements, so when that interface goes down, it knows right away. Keepalived on pi4 then drops its priority or shuts down the virtual IP immediately, which triggers failover much faster than waiting for the backup node to timeout. Meanwhile, pi3, the backup node, also detects the failure by missing advertisements from pi4. After missing three consecutive advertisements (by default, about 3 seconds with a 1-second interval), pi3 would promote itself to MASTER if the primary hadn’t already stepped down. But because pi4 detected its own link failure and stepped down immediately, the failover happens almost instantly rather than waiting for the full timeout period. When pi4’s network connectivity is restored and it comes back online, it sends advertisements again. pi3 sees that pi4 has higher priority and is now reachable, so it transitions back to BACKUP state and releases the virtual IP back to pi4.

Power outage was another scenario I tested. When pi4 loses power completely, it’s different from a network disconnect because the primary node can’t detect anything or take action. Keepalived on pi4 is completely offline, so it can’t step down gracefully. In this case, pi3 relies entirely on the advertisement timeout mechanism. When pi3 stops receiving VRRP advertisements from pi4, it knows something is wrong. After missing three consecutive advertisements (about 3 seconds with a 1-second interval), pi3 declares pi4 as failed and promotes itself to MASTER state. It then takes ownership of the virtual IP and starts serving traffic. When power is restored to pi4 and it boots back up, Keepalived starts and begins sending advertisements. pi3 sees that pi4 has higher priority and is now reachable, so it transitions back to BACKUP state and releases the virtual IP back to pi4 after the health checks pass.

All scenarios worked as expected, with failover times between 1-3 seconds. This is fast enough that users don’t notice interruptions, which meets the requirement for high availability.

What I learned

Keepalived is simpler than it seems. The configuration is straightforward once you understand VRRP concepts. The main complexity is ensuring health checks work correctly and that both nodes have identical configurations. Good documentation helps, and there’s plenty available for Keepalived.

Configuration synchronization is critical for HA. Having a reliable way to sync configurations between nodes is essential. The git-based approach works well, with a sync script that validates configurations before deploying. This prevents configuration drift that could cause services to work on one node but not the other.

Health checks matter more than I initially thought. The Traefik health check in Keepalived ensures that if Traefik crashes, failover happens automatically. This prevents the virtual IP from staying on a node with a broken service, which would cause downtime even though the network layer thinks everything is fine.

Port conflicts need to be resolved upfront. When planning HA setups, it’s important to identify potential port conflicts early. Services sharing infrastructure need coordination, and sometimes you need to make changes to accommodate the HA architecture.

Documentation and scripts make operations smoother. Creating comprehensive procedures and runbooks made the installation smooth. Having scripts for common tasks like installation and config sync reduces errors and saves time. Infrastructure as code principles apply to operations too.

The current state

The Traefik HA installation is complete and operational. pi4 serves as the primary node with Traefik running on ports 80 and 443, and the virtual IP is active. pi3 serves as the backup node with Traefik ready for failover. Keepalived monitors Traefik health and manages virtual IP failover automatically.

Pi-hole web interface is now on ports 8080 and 8443 on both pi2 and pi3. This resolved the port conflict and allows Traefik to use standard ports on both nodes. DNS is configured and operational, pointing traefik.lan to the virtual IP at 192.168.1.20.

The setup provides the high availability I need without adding unnecessary complexity. Keepalived VIP gives me automatic failover with fast recovery times, and the active/passive model works well for homelab traffic levels. Sometimes the simpler solution is the right solution, and this setup proves that point.