Skip to content
Infrastructure

Deploying paperless-ngx, self-hosted document management with OCR

By Victor Da Luz
homelab paperless-ngx proxmox docker self-hosted postgresql ocr

Paper accumulates faster than I deal with it. Bank letters, appliance manuals, tax forms, the warranty card for a drill I will need in three years. My filing system was a drawer, and “find the receipt” meant emptying the drawer onto the floor. I wanted to scan a document with my phone, have it OCR’d and searchable, and never touch the paper again. Self-hosted, because I am not uploading my tax returns to someone else’s server.

Picking a document scanner

I looked at a handful of self-hosted options: paperless-ngx, the older paperless-ng it forked from, plus Docspell and a couple of newer projects. paperless-ngx is the active community fork of paperless-ng, and it does exactly what I needed: OCR on ingest, tagging and correspondents, full-text search, a REST API, and a Docker Compose deployment. The alternatives were either less maintained (paperless-ng) or aimed at a different problem (document signing, team workflows) than “scan my mail and find it later.” paperless-ngx fit.

The stack: Docker-in-LXC

My homelab runs each service as Docker containers inside a Proxmox LXC container, which wraps a normal Compose stack in Proxmox-level backups, replication, and HA failover. Unlike a single-binary app, paperless-ngx is a three-container stack: the webserver, a PostgreSQL database, and a Redis broker for its task queue. The compose file pins postgres:15-alpine and redis:7-alpine, with the webserver from ghcr.io/paperless-ngx/paperless-ngx:latest. Traefik fronts it for HTTPS and Ansible drives the deploy.

The webserver waits for both dependencies to be healthy before it starts:

depends_on:
  db:
    condition: service_healthy
  broker:
    condition: service_healthy

That ordering matters for the bug I hit next.

The PostgreSQL permission bug that blocked the whole stack

First deploy, the webserver never came up. docker compose ps showed the database container restarting in a loop and the webserver stuck waiting on it. The Postgres logs were the tell: it could not initialize its data directory.

The cause is the Docker-in-LXC pattern. The database keeps its data on a bind-mounted host path:

db:
  image: postgres:15-alpine
  volumes:
    - ./data/postgres:/var/lib/postgresql/data
  environment:
    PGDATA: /var/lib/postgresql/data/pgdata

When Docker first creates that host directory, it is owned by root. The postgres:15-alpine image runs as uid 999, and uid 999 cannot write to a root-owned directory, so initdb fails, the healthcheck never passes, and because the webserver is gated on condition: service_healthy, the whole stack sits there waiting on a database that will never come up.

The fix is to create the host directory with the right ownership before Compose starts. I baked it into the Ansible role:

- name: Create PostgreSQL data directory with correct permissions
  file:
    path: '{{ service_dir }}/data/postgres'
    state: directory
    owner: '999'
    group: '999'
    mode: '0755'

Redis needed its own ownership on its data directory too. Once the host directories matched the uids the containers run as, Postgres initialized, the healthcheck went green, and the webserver started. Every deploy of the role now gets it right the first time.

OCR and getting documents in

paperless-ngx OCRs documents on ingest. I set three things in the environment: language eng, output PDF/A, and OCR mode skip, so a PDF that already carries a text layer is not re-processed. The admin user is created on first run from environment variables, so the instance is usable the moment it comes up.

Documents get in two ways: a watched consume folder (drop a file, paperless picks it up, OCRs it, files it) and the REST API, which is what the phone apps talk to. On iOS there are a few options, including Swift Paperless, that scan with the camera and push straight to the API. Scan a letter, it lands searchable a few seconds later.

Backups and where it landed

Because everything sits on the LXC’s disk, the documents are covered by the same Proxmox-level backups and replication as every other service: a snapshot to Proxmox Backup Server and replication to a second node for failover. On top of that, paperless has its own document_exporter, a built-in command that writes a portable, version-independent archive of every document and its metadata, so I am not solely dependent on container snapshots.

It slotted into the homelab like everything else: behind Traefik over HTTPS, on the Homepage dashboard, monitored by Uptime Kuma. The drawer is empty now.

A few takeaways:

  • Docker-in-LXC bind mounts start out owned by root. Any image that runs as a non-root uid (Postgres as 999, for one) needs its host data directory pre-created with matching ownership, or it silently fails to initialize.
  • depends_on: condition: service_healthy turns one broken container into a stuck stack. It is the right call for correctness, but it means a database that will not start looks like a webserver that will not start. Read the logs of the dependency, not the thing waiting on it.
  • A three-service stack (app, database, broker) is more to get wrong than a single SQLite file, but the database is the only hard part, and it is hard exactly once.
  • Match the OCR mode to your inputs. skip avoids re-OCRing PDFs that already have a text layer.

Related reading

Infrastructure

Researching update automation for the homelab

Twenty-three self-hosted services and no update process beyond "when I remember". I compared Watchtower, Diun, Renovate, and WUD, looked at unattended-upgrades for system packages, and landed on a hybrid plan.

Read

Ready to Transform Your Career?

Let's work together to unlock your potential and achieve your professional goals.