Building data integrity into homelab documentation - Victor Da Luz

When I started rebuilding my homelab documentation system, I knew I wanted something more robust than hoping configuration files stayed correct. After years of dealing with broken YAML, inconsistent data formats, and problems that only surfaced months later, I decided to build data integrity into the foundation of my new homelab-v2 project.

The problem was configuration drift. Over time, YAML files would accumulate inconsistent field names, missing required fields, and invalid values that broke automation scripts. Sometimes these issues would cause silent failures that I wouldn’t discover until much later, when fixing them was harder.

I needed a way to catch these issues before they became problems. That’s how I ended up with JSON Schema validation built into my documentation workflow.

The problem with configuration drift

My original homelab documentation suffered from a common problem. Over time, configuration files would drift from their intended structure. Field names would become inconsistent, like using rack_position in one place and rackPosition in another. Required fields would go missing. Invalid values would break automation scripts.

These problems were hard to catch manually. By the time I noticed something was wrong, the issue might have been in the repository for weeks or months. Fixing it meant tracing back through commit history to figure out when it broke and why.

I wanted something that would catch these problems immediately. Not after I tried to run a script. Not after I tried to deploy something. Right when I made the change, before it even got committed to the repository.

The solution was JSON Schema validation. Since JSON is a subset of YAML, YAML parsers can parse most JSON documents, which means JSON Schema works well for validating YAML data. I could define strict schemas for my configuration files and validate them automatically. This would catch problems at the source, before they could cause issues downstream.

Setting up JSON Schema validation

I started with Python and installed the essential packages. Using Python 3.14.0 managed via asdf, I installed pyyaml for YAML parsing and jsonschema for validation. These are straightforward dependencies that work well together.

Then I created a validation script that would load YAML state files, validate them against JSON schemas, provide clear error messages, and exit with proper status codes. The script runs through all state files in the repository and checks them against their corresponding schemas. If any file fails validation, the script exits with an error code, preventing the operation from completing.

The key insight was progressive validation. I didn’t try to define perfect schemas upfront. Instead, I started simple with basic required fields and types, then validated existing data to find real-world issues. This iterative process helped me discover problems I hadn’t considered initially.

Each schema refinement taught me something about my data. I’d define a schema, validate it against real data, find issues, fix the schema, and repeat. Over time, the schemas became more robust and better matched how the data was used.

What the validation process revealed

The validation process uncovered several design issues I hadn’t anticipated. These discoveries improved the schemas and made them more robust for real-world use.

Device types needed expansion. My initial enum was too restrictive. I had defined device types like server, switch, and router, but I needed to add isp_equipment for ISP-provided devices that I wanted to document but didn’t control.

IP address patterns were more flexible than I thought. I initially restricted everything to 10.0.x.x because that’s what my internal network uses. But WAN connections use different ranges like 172.27.x.x. The schema needed to accommodate these different IP address patterns.

VLAN assignments aren’t universal. Not all devices are on VLANs. Some devices connect directly to untagged networks. I needed to allow untagged as a valid VLAN assignment value, not just numeric VLAN IDs.

MAC addresses sometimes aren’t known. Some devices have unknown MAC addresses, especially if they’re ISP equipment or devices I haven’t had physical access to. I needed to allow "Unknown" as a valid MAC address value instead of requiring a specific format.

Each of these discoveries improved the schema. They made it more accurate to real-world conditions and less likely to reject valid data. The validation process wasn’t just catching errors, it was helping me understand the data better.

Automating validation with pre-commit hooks

Manual validation is better than nothing, but it’s easy to forget. I wanted automatic validation on every commit, so problems would be caught immediately.

I started with a basic Git pre-commit hook that runs the validation script. The hook runs before every commit, and if validation fails, the commit is blocked. This ensures that only valid data enters the repository.

The initial hook worked, but it wasn’t portable. Other developers, or my future self on a new machine, wouldn’t get the hook automatically when cloning the repository. Git hooks aren’t version controlled, so they need to be set up separately.

I created two approaches to make validation portable. First, a setup script that others can run after cloning to install the hooks. Second, a pre-commit framework configuration file that works with the standard pre-commit install workflow. Both approaches ensure that schema validation happens automatically, regardless of who’s working on the project or when they set it up.

The pre-commit framework approach is particularly nice because it’s a standard tool that many developers already know. Running pre-commit install sets everything up, and the framework handles running the validation script at the right time.

The results

After implementing schema validation, the documentation system became more reliable. Every commit is validated before it enters git history, which means the repository stays consistent. All state files follow the same structure, which makes automation scripts more reliable.

The schemas serve as living documentation of data formats. Instead of hoping documentation stays up to date with the code, the schemas define what valid data looks like. Future tools can rely on valid, consistent data because validation ensures it.

The validation catches problems immediately. When I make a change that breaks the schema, I know right away. There’s no waiting until deployment or until some script runs to discover the problem. It’s caught at commit time, when fixing it is easiest.

This creates confidence in the documentation system. I know that the data is structured correctly, that fields have valid values, and that automation scripts can rely on consistent formats. Small issues don’t accumulate into big problems because they’re caught early.

The iterative process

Building data integrity isn’t a one-time effort. It’s an iterative process that improves over time.

You start with simple schemas and validate real data. The real data will reveal design flaws you didn’t anticipate. Field names might be inconsistent, values might need to be more flexible than you thought, and required fields might not always be necessary.

Each iteration makes the schemas better. You fix issues as you discover them, refine constraints, and document the decisions. Over time, the schemas become more robust and better reflect real-world conditions.

The key is starting early. Building validation into the foundation means you catch problems from the beginning. Adding validation later means dealing with accumulated technical debt and fixing existing issues while trying to prevent new ones.

Building on a solid foundation

With solid data integrity in place, I can focus on the more interesting parts of the homelab. Migrating devices, building automation, and expanding the infrastructure. The validation system catches issues before they become problems, which means less time debugging and more time building.

The foundation is solid. The schemas validate data structure. Pre-commit hooks catch problems automatically. The documentation system is reliable and consistent. Now I can build on top of it with confidence, knowing that the data layer is handled.

Building data integrity into the foundation changes how you work. Instead of hoping configuration files are correct, you know they are. Instead of debugging broken scripts that fail because of invalid data, you catch data problems before scripts run. It’s a small change in approach that makes a big difference in reliability.