# NetBird Reconciler — Design Document > **Status:** Approved > **Author:** @prox > **Date:** 2026-03-03 > **Proposal:** NetBird GitOps Proposal (rev2) ## Overview A dedicated backend service that provides declarative GitOps-driven reconciliation for NetBird VPN configuration. Engineers declare desired state in `netbird.json`; the reconciler computes diffs and applies changes with all-or-nothing semantics. **Repo:** `BlastPilot/netbird-gitops` (service code + state file in one repo) **Runtime:** TypeScript / Deno **Deployment:** Docker Compose on the NetBird VPS, behind Traefik ## Architecture The reconciler has two responsibilities: 1. **Reconciliation API** — Called by Gitea Actions CI on PR events. Accepts desired state (`netbird.json`), fetches actual state from NetBird API, computes a diff, and either returns a plan (dry-run) or applies changes. 2. **Event Poller** — Background loop polling NetBird `/api/events` every 30s to detect peer enrollments. When a peer enrolls via a known setup key, the poller renames it, assigns it to the correct group, and commits `enrolled: true` back to git via Gitea API. ### Data Flow ``` Engineer -> PR to netbird-gitops (edit netbird.json) -> CI: dry-run -> reconciler -> plan posted as PR comment -> PR merged -> CI: apply -> reconciler -> mutations to NetBird API -> response with created_keys -> CI: encrypt keys with age, upload artifact Event poller (background): -> polls NetBird /api/events -> detects peer enrollment (peer.setupkey.add) -> renames peer, assigns groups -> commits enrolled:true via Gitea API ``` ### Integration with Enrollment Pipeline The existing enrollment pipeline in `blastpilot-public` changes: - **Before:** `handleApproval()` creates `peers/enrollment-{N}.json`, `handlePRMerge()` calls NetBird API directly to create setup keys, emails PDF. - **After:** `handleApproval()` modifies `netbird.json` (adds setup key + group entries) and creates PR. Key creation is handled by the reconciler on merge. Key delivery starts as manual (engineer downloads encrypted artifact), with automation added later. ## State File Format `netbird.json` at repo root. All resources referenced by name, never by NetBird ID. ```json { "groups": { "pilots": { "peers": ["Pilot-hawk-72"] }, "ground-stations": { "peers": ["GS-hawk-72"] }, "commanders": { "peers": [] } }, "setup_keys": { "GS-hawk-72": { "type": "one-off", "expires_in": 604800, "usage_limit": 1, "auto_groups": ["ground-stations"], "enrolled": true }, "Pilot-hawk-72": { "type": "one-off", "expires_in": 604800, "usage_limit": 1, "auto_groups": ["pilots"], "enrolled": false } }, "policies": { "pilots-to-gs": { "description": "Allow pilots to reach ground stations", "enabled": true, "sources": ["pilots"], "destinations": ["ground-stations"], "bidirectional": true, "protocol": "ALL" } }, "routes": { "gs-local-network": { "description": "Route to GS local subnet", "network": "192.168.1.0/24", "peer_groups": ["ground-stations"], "enabled": true } }, "dns": { "nameserver_groups": {} } } ``` **Conventions:** - Setup key name = expected peer hostname - `enrolled: false` — setup key should exist, peer hasn't connected yet - `enrolled: true` — peer detected, renamed, assigned to groups - Groups reference peers by setup key name (becomes peer hostname after rename) - Policies reference groups by name - Reconciler maintains internal name-to-ID mapping fetched at plan time ## API Endpoints All endpoints authenticated via `Authorization: Bearer `. ### `POST /reconcile` **Query params:** `dry_run=true|false` (default: false) **Request body:** Contents of `netbird.json` Behavior: 1. Fetch actual state from NetBird API (groups, setup keys, peers, policies, routes, DNS) 2. Process pending enrollments from event poller state 3. Compute diff between desired and actual 4. If `dry_run=true`: return plan without applying 5. If `dry_run=false`: execute in dependency order — groups, setup keys, peers, policies, routes. Abort on first failure. Response: ```json { "status": "applied | planned | error", "operations": [ { "type": "create_group", "name": "pilots", "status": "success" }, { "type": "create_setup_key", "name": "Pilot-hawk-72", "status": "success" }, { "type": "create_policy", "name": "pilots-to-gs", "status": "failed", "error": "..." } ], "created_keys": { "Pilot-hawk-72": "XXXXXX-XXXXXX-XXXXXX" }, "summary": { "created": 3, "updated": 1, "deleted": 0, "failed": 0 } } ``` `created_keys` only contains keys created in this run. CI uses this for encrypted artifacts. ### `POST /sync-events` Forces the event poller to process pending events immediately. Returns detected enrollments. ```json { "enrollments": [ { "setup_key_name": "GS-hawk-72", "peer_id": "abc123", "renamed": true, "groups_assigned": true } ] } ``` ### `GET /health` No auth. Returns service status for Docker healthcheck. ## Event Poller **Mechanism:** - Polls `GET /api/events` every 30 seconds (configurable via `POLL_INTERVAL_SECONDS`) - Persists `last_event_timestamp` to `/data/poller-state.json` (Docker volume) - Loads last-known `netbird.json` desired state on startup and after each reconcile **Enrollment detection:** 1. Filter events for `peer.setupkey.add` activity 2. Extract `setup_key_name` from event metadata 3. Look up in desired state — if found and `enrolled: false`: - Rename peer to match setup key name via `PUT /api/peers/{id}` - Assign peer to groups from `setup_keys[name].auto_groups` - Commit `enrolled: true` to git via Gitea API (optimistic concurrency with SHA check) - Commit message: `chore: mark {key_name} as enrolled [automated]` 4. If not found: log warning (unknown peer enrolled outside GitOps) **Edge cases:** - Race with reconcile: if reconcile is in progress, enrollment processing queues until complete - Duplicate events: idempotent — skip if peer already renamed and enrolled - Unknown peers: logged but not touched ## CI Workflows ### `dry-run.yml` — On PR open/update ```yaml on: pull_request: paths: ['netbird.json'] ``` Steps: 1. Checkout PR branch 2. `POST /reconcile?dry_run=true` with `netbird.json` 3. Format response as markdown table 4. Post/update PR comment via Gitea API ### `reconcile.yml` — On push to main ```yaml on: push: branches: [main] paths: ['netbird.json'] ``` Steps: 1. Checkout repo 2. `POST /sync-events` — process pending enrollments 3. `POST /reconcile` with `netbird.json` 4. If `created_keys` non-empty: encrypt with `age`, upload as Gitea Actions artifact 5. Pull latest (poller may have committed) 6. On failure: job fails, engineer investigates ### Gitea Secrets | Secret | Purpose | |--------|---------| | `RECONCILER_URL` | Reconciler service URL | | `RECONCILER_TOKEN` | Bearer token for CI auth | | `AGE_PUBLIC_KEY` | Encrypts setup key artifacts | | `GITEA_TOKEN` | PR comment posting (achilles-ci-bot) | ## Deployment Docker Compose on the NetBird VPS: ```yaml services: netbird-reconciler: image: gitea.internal/blastpilot/netbird-reconciler:latest restart: unless-stopped environment: NETBIRD_API_URL: "https://netbird.example.com/api" NETBIRD_API_TOKEN: "${NETBIRD_API_TOKEN}" GITEA_URL: "https://gitea.example.com" GITEA_TOKEN: "${GITEA_TOKEN}" GITEA_REPO: "BlastPilot/netbird-gitops" RECONCILER_TOKEN: "${RECONCILER_TOKEN}" POLL_INTERVAL_SECONDS: "30" PORT: "8080" volumes: - reconciler-data:/data healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 5s labels: - "traefik.enable=true" - "traefik.http.routers.reconciler.rule=Host(`reconciler.internal`)" ``` ### Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `NETBIRD_API_URL` | yes | NetBird management API base URL | | `NETBIRD_API_TOKEN` | yes | NetBird API token | | `GITEA_URL` | yes | Gitea instance URL | | `GITEA_TOKEN` | yes | Gitea API token for commits | | `GITEA_REPO` | yes | `owner/repo` for netbird-gitops | | `RECONCILER_TOKEN` | yes | Bearer token for CI auth | | `POLL_INTERVAL_SECONDS` | no | Poll interval (default: 30) | | `PORT` | no | Listen port (default: 8080) | ### Container Image Build Tag-triggered CI (`v*`) in netbird-gitops: 1. `deno compile` to single binary 2. Docker build (`FROM denoland/deno:distroless`) 3. Push to Gitea container registry ## Error Handling & Rollback **Validation phase (before mutations):** - Parse and validate `netbird.json` schema - Fetch all actual state - Compute diff and verify all operations are possible - If validation fails: return error, no mutations **Apply phase:** - Execute in dependency order (groups -> keys -> peers -> policies -> routes) - On any failure: abort immediately, return partial results - No automatic rollback — git revert is the rollback mechanism **Why no automatic rollback:** - Partial rollback is harder to get right than partial apply - Git history provides clear, auditable rollback path - `git revert` + re-reconcile converges to correct state - Reconciler is idempotent — running twice with same state is safe **Recovery pattern:** 1. Reconcile fails mid-apply 2. CI job fails, engineer notified 3. Engineer either forward-fixes `netbird.json` or `git revert`s the merge commit 4. New push triggers reconcile, converging to correct state **Logging:** - Structured JSON logs - Every NetBird API call logged (method, path, status) - Every state mutation logged (before/after) - Event poller logs each event processed ## Resources Managed | Resource | NetBird API | Create | Update | Delete | |----------|-------------|--------|--------|--------| | Groups | `/api/groups` | yes | yes (peers) | yes | | Setup Keys | `/api/setup-keys` | yes | no (immutable) | yes | | Peers | `/api/peers` | no (self-enroll) | yes (rename, groups) | yes | | Policies | `/api/policies` | yes | yes | yes | | Routes | `/api/routes` | yes | yes | yes | | DNS | `/api/dns/nameservers` | yes | yes | yes |