netbird-gitops/docs/plans/2026-03-03-netbird-reconciler-design.md
2026-03-03 23:45:05 +02:00

10 KiB

NetBird Reconciler — Design Document

Status: Approved Author: @prox Date: 2026-03-03 Proposal: NetBird GitOps Proposal (rev2)

Overview

A dedicated backend service that provides declarative GitOps-driven reconciliation for NetBird VPN configuration. Engineers declare desired state in netbird.json; the reconciler computes diffs and applies changes with all-or-nothing semantics.

Repo: BlastPilot/netbird-gitops (service code + state file in one repo) Runtime: TypeScript / Deno Deployment: Docker Compose on the NetBird VPS, behind Traefik

Architecture

The reconciler has two responsibilities:

  1. Reconciliation API — Called by Gitea Actions CI on PR events. Accepts desired state (netbird.json), fetches actual state from NetBird API, computes a diff, and either returns a plan (dry-run) or applies changes.

  2. Event Poller — Background loop polling NetBird /api/events every 30s to detect peer enrollments. When a peer enrolls via a known setup key, the poller renames it, assigns it to the correct group, and commits enrolled: true back to git via Gitea API.

Data Flow

Engineer -> PR to netbird-gitops (edit netbird.json)
         -> CI: dry-run -> reconciler -> plan posted as PR comment
         -> PR merged -> CI: apply -> reconciler -> mutations to NetBird API
                                                 -> response with created_keys
         -> CI: encrypt keys with age, upload artifact

Event poller (background):
         -> polls NetBird /api/events
         -> detects peer enrollment (peer.setupkey.add)
         -> renames peer, assigns groups
         -> commits enrolled:true via Gitea API

Integration with Enrollment Pipeline

The existing enrollment pipeline in blastpilot-public changes:

  • Before: handleApproval() creates peers/enrollment-{N}.json, handlePRMerge() calls NetBird API directly to create setup keys, emails PDF.
  • After: handleApproval() modifies netbird.json (adds setup key + group entries) and creates PR. Key creation is handled by the reconciler on merge. Key delivery starts as manual (engineer downloads encrypted artifact), with automation added later.

State File Format

netbird.json at repo root. All resources referenced by name, never by NetBird ID.

{
  "groups": {
    "pilots": { "peers": ["Pilot-hawk-72"] },
    "ground-stations": { "peers": ["GS-hawk-72"] },
    "commanders": { "peers": [] }
  },
  "setup_keys": {
    "GS-hawk-72": {
      "type": "one-off",
      "expires_in": 604800,
      "usage_limit": 1,
      "auto_groups": ["ground-stations"],
      "enrolled": true
    },
    "Pilot-hawk-72": {
      "type": "one-off",
      "expires_in": 604800,
      "usage_limit": 1,
      "auto_groups": ["pilots"],
      "enrolled": false
    }
  },
  "policies": {
    "pilots-to-gs": {
      "description": "Allow pilots to reach ground stations",
      "enabled": true,
      "sources": ["pilots"],
      "destinations": ["ground-stations"],
      "bidirectional": true,
      "protocol": "ALL"
    }
  },
  "routes": {
    "gs-local-network": {
      "description": "Route to GS local subnet",
      "network": "192.168.1.0/24",
      "peer_groups": ["ground-stations"],
      "enabled": true
    }
  },
  "dns": {
    "nameserver_groups": {}
  }
}

Conventions:

  • Setup key name = expected peer hostname
  • enrolled: false — setup key should exist, peer hasn't connected yet
  • enrolled: true — peer detected, renamed, assigned to groups
  • Groups reference peers by setup key name (becomes peer hostname after rename)
  • Policies reference groups by name
  • Reconciler maintains internal name-to-ID mapping fetched at plan time

API Endpoints

All endpoints authenticated via Authorization: Bearer <token>.

POST /reconcile

Query params: dry_run=true|false (default: false) Request body: Contents of netbird.json

Behavior:

  1. Fetch actual state from NetBird API (groups, setup keys, peers, policies, routes, DNS)
  2. Process pending enrollments from event poller state
  3. Compute diff between desired and actual
  4. If dry_run=true: return plan without applying
  5. If dry_run=false: execute in dependency order — groups, setup keys, peers, policies, routes. Abort on first failure.

Response:

{
  "status": "applied | planned | error",
  "operations": [
    { "type": "create_group", "name": "pilots", "status": "success" },
    { "type": "create_setup_key", "name": "Pilot-hawk-72", "status": "success" },
    { "type": "create_policy", "name": "pilots-to-gs", "status": "failed", "error": "..." }
  ],
  "created_keys": {
    "Pilot-hawk-72": "XXXXXX-XXXXXX-XXXXXX"
  },
  "summary": { "created": 3, "updated": 1, "deleted": 0, "failed": 0 }
}

created_keys only contains keys created in this run. CI uses this for encrypted artifacts.

POST /sync-events

Forces the event poller to process pending events immediately. Returns detected enrollments.

{
  "enrollments": [
    { "setup_key_name": "GS-hawk-72", "peer_id": "abc123", "renamed": true, "groups_assigned": true }
  ]
}

GET /health

No auth. Returns service status for Docker healthcheck.

Event Poller

Mechanism:

  • Polls GET /api/events every 30 seconds (configurable via POLL_INTERVAL_SECONDS)
  • Persists last_event_timestamp to /data/poller-state.json (Docker volume)
  • Loads last-known netbird.json desired state on startup and after each reconcile

Enrollment detection:

  1. Filter events for peer.setupkey.add activity
  2. Extract setup_key_name from event metadata
  3. Look up in desired state — if found and enrolled: false:
    • Rename peer to match setup key name via PUT /api/peers/{id}
    • Assign peer to groups from setup_keys[name].auto_groups
    • Commit enrolled: true to git via Gitea API (optimistic concurrency with SHA check)
    • Commit message: chore: mark {key_name} as enrolled [automated]
  4. If not found: log warning (unknown peer enrolled outside GitOps)

Edge cases:

  • Race with reconcile: if reconcile is in progress, enrollment processing queues until complete
  • Duplicate events: idempotent — skip if peer already renamed and enrolled
  • Unknown peers: logged but not touched

CI Workflows

dry-run.yml — On PR open/update

on:
  pull_request:
    paths: ['netbird.json']

Steps:

  1. Checkout PR branch
  2. POST /reconcile?dry_run=true with netbird.json
  3. Format response as markdown table
  4. Post/update PR comment via Gitea API

reconcile.yml — On push to main

on:
  push:
    branches: [main]
    paths: ['netbird.json']

Steps:

  1. Checkout repo
  2. POST /sync-events — process pending enrollments
  3. POST /reconcile with netbird.json
  4. If created_keys non-empty: encrypt with age, upload as Gitea Actions artifact
  5. Pull latest (poller may have committed)
  6. On failure: job fails, engineer investigates

Gitea Secrets

Secret Purpose
RECONCILER_URL Reconciler service URL
RECONCILER_TOKEN Bearer token for CI auth
AGE_PUBLIC_KEY Encrypts setup key artifacts
GITEA_TOKEN PR comment posting (achilles-ci-bot)

Deployment

Docker Compose on the NetBird VPS:

services:
  netbird-reconciler:
    image: gitea.internal/blastpilot/netbird-reconciler:latest
    restart: unless-stopped
    environment:
      NETBIRD_API_URL: "https://netbird.example.com/api"
      NETBIRD_API_TOKEN: "${NETBIRD_API_TOKEN}"
      GITEA_URL: "https://gitea.example.com"
      GITEA_TOKEN: "${GITEA_TOKEN}"
      GITEA_REPO: "BlastPilot/netbird-gitops"
      RECONCILER_TOKEN: "${RECONCILER_TOKEN}"
      POLL_INTERVAL_SECONDS: "30"
      PORT: "8080"
    volumes:
      - reconciler-data:/data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 5s
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.reconciler.rule=Host(`reconciler.internal`)"

Environment Variables

Variable Required Description
NETBIRD_API_URL yes NetBird management API base URL
NETBIRD_API_TOKEN yes NetBird API token
GITEA_URL yes Gitea instance URL
GITEA_TOKEN yes Gitea API token for commits
GITEA_REPO yes owner/repo for netbird-gitops
RECONCILER_TOKEN yes Bearer token for CI auth
POLL_INTERVAL_SECONDS no Poll interval (default: 30)
PORT no Listen port (default: 8080)

Container Image Build

Tag-triggered CI (v*) in netbird-gitops:

  1. deno compile to single binary
  2. Docker build (FROM denoland/deno:distroless)
  3. Push to Gitea container registry

Error Handling & Rollback

Validation phase (before mutations):

  • Parse and validate netbird.json schema
  • Fetch all actual state
  • Compute diff and verify all operations are possible
  • If validation fails: return error, no mutations

Apply phase:

  • Execute in dependency order (groups -> keys -> peers -> policies -> routes)
  • On any failure: abort immediately, return partial results
  • No automatic rollback — git revert is the rollback mechanism

Why no automatic rollback:

  • Partial rollback is harder to get right than partial apply
  • Git history provides clear, auditable rollback path
  • git revert + re-reconcile converges to correct state
  • Reconciler is idempotent — running twice with same state is safe

Recovery pattern:

  1. Reconcile fails mid-apply
  2. CI job fails, engineer notified
  3. Engineer either forward-fixes netbird.json or git reverts the merge commit
  4. New push triggers reconcile, converging to correct state

Logging:

  • Structured JSON logs
  • Every NetBird API call logged (method, path, status)
  • Every state mutation logged (before/after)
  • Event poller logs each event processed

Resources Managed

Resource NetBird API Create Update Delete
Groups /api/groups yes yes (peers) yes
Setup Keys /api/setup-keys yes no (immutable) yes
Peers /api/peers no (self-enroll) yes (rename, groups) yes
Policies /api/policies yes yes yes
Routes /api/routes yes yes yes
DNS /api/dns/nameservers yes yes yes