11 KiB
NetBird Reconciler — Design Document
Status: Approved Author: @prox Date: 2026-03-03 Proposal: NetBird GitOps Proposal (rev2)
Overview
A dedicated backend service that provides declarative GitOps-driven
reconciliation for NetBird VPN configuration. Engineers declare desired state in
netbird.json; the reconciler computes diffs and applies changes with
all-or-nothing semantics.
Repo: BlastPilot/netbird-gitops (service code + state file in one repo)
Runtime: TypeScript / Deno Deployment: Docker Compose on the NetBird
VPS, behind Traefik
Architecture
The reconciler has two responsibilities:
-
Reconciliation API — Called by Gitea Actions CI on PR events. Accepts desired state (
netbird.json), fetches actual state from NetBird API, computes a diff, and either returns a plan (dry-run) or applies changes. -
Event Poller — Background loop polling NetBird
/api/eventsevery 30s to detect peer enrollments. When a peer enrolls via a known setup key, the poller renames it, assigns it to the correct group, and commitsenrolled: trueback to git via Gitea API.
Data Flow
Engineer -> PR to netbird-gitops (edit netbird.json)
-> CI: dry-run -> reconciler -> plan posted as PR comment
-> PR merged -> CI: apply -> reconciler -> mutations to NetBird API
-> response with created_keys
-> CI: encrypt keys with age, upload artifact
Event poller (background):
-> polls NetBird /api/events
-> detects peer enrollment (peer.setupkey.add)
-> renames peer, assigns groups
-> commits enrolled:true via Gitea API
Integration with Enrollment Pipeline
The existing enrollment pipeline in blastpilot-public changes:
- Before:
handleApproval()createspeers/enrollment-{N}.json,handlePRMerge()calls NetBird API directly to create setup keys, emails PDF. - After:
handleApproval()modifiesnetbird.json(adds setup key + group entries) and creates PR. Key creation is handled by the reconciler on merge. Key delivery starts as manual (engineer downloads encrypted artifact), with automation added later.
State File Format
netbird.json at repo root. All resources referenced by name, never by NetBird
ID.
{
"groups": {
"pilots": { "peers": ["Pilot-hawk-72"] },
"ground-stations": { "peers": ["GS-hawk-72"] },
"commanders": { "peers": [] }
},
"setup_keys": {
"GS-hawk-72": {
"type": "one-off",
"expires_in": 604800,
"usage_limit": 1,
"auto_groups": ["ground-stations"],
"enrolled": true
},
"Pilot-hawk-72": {
"type": "one-off",
"expires_in": 604800,
"usage_limit": 1,
"auto_groups": ["pilots"],
"enrolled": false
}
},
"policies": {
"pilots-to-gs": {
"description": "Allow pilots to reach ground stations",
"enabled": true,
"sources": ["pilots"],
"destinations": ["ground-stations"],
"bidirectional": true,
"protocol": "ALL"
}
},
"routes": {
"gs-local-network": {
"description": "Route to GS local subnet",
"network": "192.168.1.0/24",
"peer_groups": ["ground-stations"],
"enabled": true
}
},
"dns": {
"nameserver_groups": {}
}
}
Conventions:
- Setup key name = expected peer hostname
enrolled: false— setup key should exist, peer hasn't connected yetenrolled: true— peer detected, renamed, assigned to groups- Groups reference peers by setup key name (becomes peer hostname after rename)
- Policies reference groups by name
- Reconciler maintains internal name-to-ID mapping fetched at plan time
API Endpoints
All endpoints authenticated via Authorization: Bearer <token>.
POST /reconcile
Query params: dry_run=true|false (default: false) Request body:
Contents of netbird.json
Behavior:
- Fetch actual state from NetBird API (groups, setup keys, peers, policies, routes, DNS)
- Process pending enrollments from event poller state
- Compute diff between desired and actual
- If
dry_run=true: return plan without applying - If
dry_run=false: execute in dependency order — groups, setup keys, peers, policies, routes. Abort on first failure.
Response:
{
"status": "applied | planned | error",
"operations": [
{ "type": "create_group", "name": "pilots", "status": "success" },
{
"type": "create_setup_key",
"name": "Pilot-hawk-72",
"status": "success"
},
{
"type": "create_policy",
"name": "pilots-to-gs",
"status": "failed",
"error": "..."
}
],
"created_keys": {
"Pilot-hawk-72": "XXXXXX-XXXXXX-XXXXXX"
},
"summary": { "created": 3, "updated": 1, "deleted": 0, "failed": 0 }
}
created_keys only contains keys created in this run. CI uses this for
encrypted artifacts.
POST /sync-events
Forces the event poller to process pending events immediately. Returns detected enrollments.
{
"enrollments": [
{
"setup_key_name": "GS-hawk-72",
"peer_id": "abc123",
"renamed": true,
"groups_assigned": true
}
]
}
GET /health
No auth. Returns service status for Docker healthcheck.
Event Poller
Mechanism:
- Polls
GET /api/eventsevery 30 seconds (configurable viaPOLL_INTERVAL_SECONDS) - Persists
last_event_timestampto/data/poller-state.json(Docker volume) - Loads last-known
netbird.jsondesired state on startup and after each reconcile
Enrollment detection:
- Filter events for
peer.setupkey.addactivity - Extract
setup_key_namefrom event metadata - Look up in desired state — if found and
enrolled: false:- Rename peer to match setup key name via
PUT /api/peers/{id} - Assign peer to groups from
setup_keys[name].auto_groups - Commit
enrolled: trueto git via Gitea API (optimistic concurrency with SHA check) - Commit message:
chore: mark {key_name} as enrolled [automated]
- Rename peer to match setup key name via
- If not found: log warning (unknown peer enrolled outside GitOps)
Edge cases:
- Race with reconcile: if reconcile is in progress, enrollment processing queues until complete
- Duplicate events: idempotent — skip if peer already renamed and enrolled
- Unknown peers: logged but not touched
CI Workflows
dry-run.yml — On PR open/update
on:
pull_request:
paths: ["netbird.json"]
Steps:
- Checkout PR branch
POST /reconcile?dry_run=truewithnetbird.json- Format response as markdown table
- Post/update PR comment via Gitea API
reconcile.yml — On push to main
on:
push:
branches: [main]
paths: ["netbird.json"]
Steps:
- Checkout repo
POST /sync-events— process pending enrollmentsPOST /reconcilewithnetbird.json- If
created_keysnon-empty: encrypt withage, upload as Gitea Actions artifact - Pull latest (poller may have committed)
- On failure: job fails, engineer investigates
Gitea Secrets
| Secret | Purpose |
|---|---|
RECONCILER_URL |
Reconciler service URL |
RECONCILER_TOKEN |
Bearer token for CI auth |
AGE_PUBLIC_KEY |
Encrypts setup key artifacts |
GITEA_TOKEN |
PR comment posting (achilles-ci-bot) |
Deployment
Docker Compose on the NetBird VPS:
services:
netbird-reconciler:
image: gitea.internal/blastpilot/netbird-reconciler:latest
restart: unless-stopped
environment:
NETBIRD_API_URL: "https://netbird.example.com/api"
NETBIRD_API_TOKEN: "${NETBIRD_API_TOKEN}"
GITEA_URL: "https://gitea.example.com"
GITEA_TOKEN: "${GITEA_TOKEN}"
GITEA_REPO: "BlastPilot/netbird-gitops"
RECONCILER_TOKEN: "${RECONCILER_TOKEN}"
POLL_INTERVAL_SECONDS: "30"
PORT: "8080"
volumes:
- reconciler-data:/data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
labels:
- "traefik.enable=true"
- "traefik.http.routers.reconciler.rule=Host(`reconciler.internal`)"
Environment Variables
| Variable | Required | Description |
|---|---|---|
NETBIRD_API_URL |
yes | NetBird management API base URL |
NETBIRD_API_TOKEN |
yes | NetBird API token |
GITEA_URL |
yes | Gitea instance URL |
GITEA_TOKEN |
yes | Gitea API token for commits |
GITEA_REPO |
yes | owner/repo for netbird-gitops |
RECONCILER_TOKEN |
yes | Bearer token for CI auth |
POLL_INTERVAL_SECONDS |
no | Poll interval (default: 30) |
PORT |
no | Listen port (default: 8080) |
Container Image Build
Tag-triggered CI (v*) in netbird-gitops:
deno compileto single binary- Docker build (
FROM denoland/deno:distroless) - Push to Gitea container registry
Error Handling & Rollback
Validation phase (before mutations):
- Parse and validate
netbird.jsonschema - Fetch all actual state
- Compute diff and verify all operations are possible
- If validation fails: return error, no mutations
Apply phase:
- Execute in dependency order (groups -> keys -> peers -> policies -> routes)
- On any failure: abort immediately, return partial results
- No automatic rollback — git revert is the rollback mechanism
Why no automatic rollback:
- Partial rollback is harder to get right than partial apply
- Git history provides clear, auditable rollback path
git revert+ re-reconcile converges to correct state- Reconciler is idempotent — running twice with same state is safe
Recovery pattern:
- Reconcile fails mid-apply
- CI job fails, engineer notified
- Engineer either forward-fixes
netbird.jsonorgit reverts the merge commit - New push triggers reconcile, converging to correct state
Logging:
- Structured JSON logs
- Every NetBird API call logged (method, path, status)
- Every state mutation logged (before/after)
- Event poller logs each event processed
Resources Managed
| Resource | NetBird API | Create | Update | Delete |
|---|---|---|---|---|
| Groups | /api/groups |
yes | yes (peers) | yes |
| Setup Keys | /api/setup-keys |
yes | no (immutable) | yes |
| Peers | /api/peers |
no (self-enroll) | yes (rename, groups) | yes |
| Policies | /api/policies |
yes | yes | yes |
| Routes | /api/routes |
yes | yes | yes |
| DNS | /api/dns/nameservers |
yes | yes | yes |