The service was up but the site was down: a missing Traefik route

I went to open RomM, my self-hosted ROM manager, and got a 404. Not a timeout, not a connection refused, a clean “404 page not found.” The odd part: RomM itself was running. The container was up, the app was healthy, and I could reach it directly. Only the public URL was dead. This is the story of a service that was up while its site was down, and the small gap that caused it.

A 404 is a routing answer, not a crash

The first instinct on a dead URL is “the service crashed.” A 404 argues against that. A crashed or unreachable backend gives you a timeout, a connection refused, or a 502/503 from the proxy. A clean 404 means something answered and said “I have no idea what you are asking for.” On my setup the thing answering is Traefik, the reverse proxy that fronts every service. So the question was not “is RomM down,” it was “why does Traefik not know about RomM.”

Ruling out the service and DNS

I checked the two easy things first.

The service: hitting the container directly, bypassing the proxy, returned a normal response.

curl -sS -o /dev/null -w '%{http_code}\n' http://<romm-container>:7676
# 200

RomM was fine. Then DNS:

dig +short romm.example.net
# (the Traefik VIP)

DNS resolved to the Traefik load balancer, which is correct. So the request was leaving my machine, resolving properly, and arriving at Traefik. The break was inside Traefik: it had received the request and had no router that matched Host(romm.example.net).

The router that was not there

Traefik discovers routes two ways on my setup, and the relevant one here is its file provider, which watches a directory of dynamic configuration:

providers:
  file:
    directory: '/etc/traefik/dynamic'
    watch: true

Each service gets a small YAML file in that directory describing its router, backend, and middleware. RomM’s looks like this:

http:
  routers:
    romm:
      rule: 'Host(`romm.example.net`)'
      service: romm
      entryPoints:
        - websecure
      tls:
        certResolver: cloudflare
  services:
    romm:
      loadBalancer:
        servers:
          - url: 'http://<romm-container>:7676'

That file existed in my repo. It did not exist on the server. ls /etc/traefik/dynamic on the proxy host had a file for every other service and nothing for RomM. With no file, the file provider had no router, and a request for romm.example.net fell through to Traefik’s default 404. The service was healthy the entire time, and the thing that was actually missing was a 25-line config that tells the proxy the service exists.

Committed is not deployed

Here is the gap. Deploying RomM and deploying RomM’s route are two separate Ansible plays. The role that configures Traefik finds every dynamic config in the repo and copies it to the server:

- name: Get list of dynamic configuration files
  find:
    paths: '{{ service_config_dir }}/dynamic'
    patterns: '*.yml'
  register: dynamic_config_files
  delegate_to: localhost

- name: Deploy dynamic configuration files
  copy:
    src: '{{ item.path }}'
    dest: '{{ traefik_dynamic_config_dir }}/{{ item.path | basename }}'
  loop: '{{ dynamic_config_files.files }}'

I had added romm.yml to the repo when I set the service up, but I never ran the Traefik play afterward. The RomM deploy succeeded, the commit was clean, and the routing config sat in version control doing nothing, because nothing had copied it to the box. Running the Traefik playbook fixed it in one shot. Ansible copied the file across, and the route registered itself: the file provider has watch: true, so a dropped-in config is picked up live, and romm.example.net went from 404 to 200 the moment the file landed. The play also bounces Traefik through a restart handler at the end, but by then the route was already serving, so the restart was redundant.

Lessons

A 404 on a known service is a routing problem, not a service problem. Check the proxy’s view of the world before you go restarting containers. The backend being healthy and the URL being dead at the same time points straight at the layer in between.
“It is in the repo” and “it is on the server” are different claims. Infrastructure-as-code only helps if you actually run the apply step. A clean git history is not a deployed state.
Watch out for multi-step deploys where the app and its proxy config ship separately. The failure mode is silent: the service comes up green, its health check passes, and the only thing broken is the route, which nothing monitors as carefully as the service itself.
An end-to-end check on the public URL belongs in the deploy itself. A container health check alone would have stayed green through this entire outage, because the app it checks was never the problem.

Victor Da Luz

A 404 is a routing answer, not a crash

Ruling out the service and DNS

The router that was not there

Committed is not deployed

Lessons

Related reading

Diagnosing slow RomM scans on a large ROM library

Replacing Firefly III with Actual Budget

Recovering Syncthing from a truncated config.xml

Ready to Transform Your Career?