310 lines
10 KiB
Markdown
310 lines
10 KiB
Markdown
# NetBird Reconciler — Design Document
|
|
|
|
> **Status:** Approved
|
|
> **Author:** @prox
|
|
> **Date:** 2026-03-03
|
|
> **Proposal:** NetBird GitOps Proposal (rev2)
|
|
|
|
## Overview
|
|
|
|
A dedicated backend service that provides declarative GitOps-driven reconciliation for NetBird VPN configuration. Engineers declare desired state in `netbird.json`; the reconciler computes diffs and applies changes with all-or-nothing semantics.
|
|
|
|
**Repo:** `BlastPilot/netbird-gitops` (service code + state file in one repo)
|
|
**Runtime:** TypeScript / Deno
|
|
**Deployment:** Docker Compose on the NetBird VPS, behind Traefik
|
|
|
|
## Architecture
|
|
|
|
The reconciler has two responsibilities:
|
|
|
|
1. **Reconciliation API** — Called by Gitea Actions CI on PR events. Accepts desired state (`netbird.json`), fetches actual state from NetBird API, computes a diff, and either returns a plan (dry-run) or applies changes.
|
|
|
|
2. **Event Poller** — Background loop polling NetBird `/api/events` every 30s to detect peer enrollments. When a peer enrolls via a known setup key, the poller renames it, assigns it to the correct group, and commits `enrolled: true` back to git via Gitea API.
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
Engineer -> PR to netbird-gitops (edit netbird.json)
|
|
-> CI: dry-run -> reconciler -> plan posted as PR comment
|
|
-> PR merged -> CI: apply -> reconciler -> mutations to NetBird API
|
|
-> response with created_keys
|
|
-> CI: encrypt keys with age, upload artifact
|
|
|
|
Event poller (background):
|
|
-> polls NetBird /api/events
|
|
-> detects peer enrollment (peer.setupkey.add)
|
|
-> renames peer, assigns groups
|
|
-> commits enrolled:true via Gitea API
|
|
```
|
|
|
|
### Integration with Enrollment Pipeline
|
|
|
|
The existing enrollment pipeline in `blastpilot-public` changes:
|
|
|
|
- **Before:** `handleApproval()` creates `peers/enrollment-{N}.json`, `handlePRMerge()` calls NetBird API directly to create setup keys, emails PDF.
|
|
- **After:** `handleApproval()` modifies `netbird.json` (adds setup key + group entries) and creates PR. Key creation is handled by the reconciler on merge. Key delivery starts as manual (engineer downloads encrypted artifact), with automation added later.
|
|
|
|
## State File Format
|
|
|
|
`netbird.json` at repo root. All resources referenced by name, never by NetBird ID.
|
|
|
|
```json
|
|
{
|
|
"groups": {
|
|
"pilots": { "peers": ["Pilot-hawk-72"] },
|
|
"ground-stations": { "peers": ["GS-hawk-72"] },
|
|
"commanders": { "peers": [] }
|
|
},
|
|
"setup_keys": {
|
|
"GS-hawk-72": {
|
|
"type": "one-off",
|
|
"expires_in": 604800,
|
|
"usage_limit": 1,
|
|
"auto_groups": ["ground-stations"],
|
|
"enrolled": true
|
|
},
|
|
"Pilot-hawk-72": {
|
|
"type": "one-off",
|
|
"expires_in": 604800,
|
|
"usage_limit": 1,
|
|
"auto_groups": ["pilots"],
|
|
"enrolled": false
|
|
}
|
|
},
|
|
"policies": {
|
|
"pilots-to-gs": {
|
|
"description": "Allow pilots to reach ground stations",
|
|
"enabled": true,
|
|
"sources": ["pilots"],
|
|
"destinations": ["ground-stations"],
|
|
"bidirectional": true,
|
|
"protocol": "ALL"
|
|
}
|
|
},
|
|
"routes": {
|
|
"gs-local-network": {
|
|
"description": "Route to GS local subnet",
|
|
"network": "192.168.1.0/24",
|
|
"peer_groups": ["ground-stations"],
|
|
"enabled": true
|
|
}
|
|
},
|
|
"dns": {
|
|
"nameserver_groups": {}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Conventions:**
|
|
- Setup key name = expected peer hostname
|
|
- `enrolled: false` — setup key should exist, peer hasn't connected yet
|
|
- `enrolled: true` — peer detected, renamed, assigned to groups
|
|
- Groups reference peers by setup key name (becomes peer hostname after rename)
|
|
- Policies reference groups by name
|
|
- Reconciler maintains internal name-to-ID mapping fetched at plan time
|
|
|
|
## API Endpoints
|
|
|
|
All endpoints authenticated via `Authorization: Bearer <token>`.
|
|
|
|
### `POST /reconcile`
|
|
|
|
**Query params:** `dry_run=true|false` (default: false)
|
|
**Request body:** Contents of `netbird.json`
|
|
|
|
Behavior:
|
|
1. Fetch actual state from NetBird API (groups, setup keys, peers, policies, routes, DNS)
|
|
2. Process pending enrollments from event poller state
|
|
3. Compute diff between desired and actual
|
|
4. If `dry_run=true`: return plan without applying
|
|
5. If `dry_run=false`: execute in dependency order — groups, setup keys, peers, policies, routes. Abort on first failure.
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"status": "applied | planned | error",
|
|
"operations": [
|
|
{ "type": "create_group", "name": "pilots", "status": "success" },
|
|
{ "type": "create_setup_key", "name": "Pilot-hawk-72", "status": "success" },
|
|
{ "type": "create_policy", "name": "pilots-to-gs", "status": "failed", "error": "..." }
|
|
],
|
|
"created_keys": {
|
|
"Pilot-hawk-72": "XXXXXX-XXXXXX-XXXXXX"
|
|
},
|
|
"summary": { "created": 3, "updated": 1, "deleted": 0, "failed": 0 }
|
|
}
|
|
```
|
|
|
|
`created_keys` only contains keys created in this run. CI uses this for encrypted artifacts.
|
|
|
|
### `POST /sync-events`
|
|
|
|
Forces the event poller to process pending events immediately. Returns detected enrollments.
|
|
|
|
```json
|
|
{
|
|
"enrollments": [
|
|
{ "setup_key_name": "GS-hawk-72", "peer_id": "abc123", "renamed": true, "groups_assigned": true }
|
|
]
|
|
}
|
|
```
|
|
|
|
### `GET /health`
|
|
|
|
No auth. Returns service status for Docker healthcheck.
|
|
|
|
## Event Poller
|
|
|
|
**Mechanism:**
|
|
- Polls `GET /api/events` every 30 seconds (configurable via `POLL_INTERVAL_SECONDS`)
|
|
- Persists `last_event_timestamp` to `/data/poller-state.json` (Docker volume)
|
|
- Loads last-known `netbird.json` desired state on startup and after each reconcile
|
|
|
|
**Enrollment detection:**
|
|
1. Filter events for `peer.setupkey.add` activity
|
|
2. Extract `setup_key_name` from event metadata
|
|
3. Look up in desired state — if found and `enrolled: false`:
|
|
- Rename peer to match setup key name via `PUT /api/peers/{id}`
|
|
- Assign peer to groups from `setup_keys[name].auto_groups`
|
|
- Commit `enrolled: true` to git via Gitea API (optimistic concurrency with SHA check)
|
|
- Commit message: `chore: mark {key_name} as enrolled [automated]`
|
|
4. If not found: log warning (unknown peer enrolled outside GitOps)
|
|
|
|
**Edge cases:**
|
|
- Race with reconcile: if reconcile is in progress, enrollment processing queues until complete
|
|
- Duplicate events: idempotent — skip if peer already renamed and enrolled
|
|
- Unknown peers: logged but not touched
|
|
|
|
## CI Workflows
|
|
|
|
### `dry-run.yml` — On PR open/update
|
|
|
|
```yaml
|
|
on:
|
|
pull_request:
|
|
paths: ['netbird.json']
|
|
```
|
|
|
|
Steps:
|
|
1. Checkout PR branch
|
|
2. `POST /reconcile?dry_run=true` with `netbird.json`
|
|
3. Format response as markdown table
|
|
4. Post/update PR comment via Gitea API
|
|
|
|
### `reconcile.yml` — On push to main
|
|
|
|
```yaml
|
|
on:
|
|
push:
|
|
branches: [main]
|
|
paths: ['netbird.json']
|
|
```
|
|
|
|
Steps:
|
|
1. Checkout repo
|
|
2. `POST /sync-events` — process pending enrollments
|
|
3. `POST /reconcile` with `netbird.json`
|
|
4. If `created_keys` non-empty: encrypt with `age`, upload as Gitea Actions artifact
|
|
5. Pull latest (poller may have committed)
|
|
6. On failure: job fails, engineer investigates
|
|
|
|
### Gitea Secrets
|
|
|
|
| Secret | Purpose |
|
|
|--------|---------|
|
|
| `RECONCILER_URL` | Reconciler service URL |
|
|
| `RECONCILER_TOKEN` | Bearer token for CI auth |
|
|
| `AGE_PUBLIC_KEY` | Encrypts setup key artifacts |
|
|
| `GITEA_TOKEN` | PR comment posting (achilles-ci-bot) |
|
|
|
|
## Deployment
|
|
|
|
Docker Compose on the NetBird VPS:
|
|
|
|
```yaml
|
|
services:
|
|
netbird-reconciler:
|
|
image: gitea.internal/blastpilot/netbird-reconciler:latest
|
|
restart: unless-stopped
|
|
environment:
|
|
NETBIRD_API_URL: "https://netbird.example.com/api"
|
|
NETBIRD_API_TOKEN: "${NETBIRD_API_TOKEN}"
|
|
GITEA_URL: "https://gitea.example.com"
|
|
GITEA_TOKEN: "${GITEA_TOKEN}"
|
|
GITEA_REPO: "BlastPilot/netbird-gitops"
|
|
RECONCILER_TOKEN: "${RECONCILER_TOKEN}"
|
|
POLL_INTERVAL_SECONDS: "30"
|
|
PORT: "8080"
|
|
volumes:
|
|
- reconciler-data:/data
|
|
healthcheck:
|
|
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
|
|
interval: 30s
|
|
timeout: 5s
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.reconciler.rule=Host(`reconciler.internal`)"
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Required | Description |
|
|
|----------|----------|-------------|
|
|
| `NETBIRD_API_URL` | yes | NetBird management API base URL |
|
|
| `NETBIRD_API_TOKEN` | yes | NetBird API token |
|
|
| `GITEA_URL` | yes | Gitea instance URL |
|
|
| `GITEA_TOKEN` | yes | Gitea API token for commits |
|
|
| `GITEA_REPO` | yes | `owner/repo` for netbird-gitops |
|
|
| `RECONCILER_TOKEN` | yes | Bearer token for CI auth |
|
|
| `POLL_INTERVAL_SECONDS` | no | Poll interval (default: 30) |
|
|
| `PORT` | no | Listen port (default: 8080) |
|
|
|
|
### Container Image Build
|
|
|
|
Tag-triggered CI (`v*`) in netbird-gitops:
|
|
1. `deno compile` to single binary
|
|
2. Docker build (`FROM denoland/deno:distroless`)
|
|
3. Push to Gitea container registry
|
|
|
|
## Error Handling & Rollback
|
|
|
|
**Validation phase (before mutations):**
|
|
- Parse and validate `netbird.json` schema
|
|
- Fetch all actual state
|
|
- Compute diff and verify all operations are possible
|
|
- If validation fails: return error, no mutations
|
|
|
|
**Apply phase:**
|
|
- Execute in dependency order (groups -> keys -> peers -> policies -> routes)
|
|
- On any failure: abort immediately, return partial results
|
|
- No automatic rollback — git revert is the rollback mechanism
|
|
|
|
**Why no automatic rollback:**
|
|
- Partial rollback is harder to get right than partial apply
|
|
- Git history provides clear, auditable rollback path
|
|
- `git revert` + re-reconcile converges to correct state
|
|
- Reconciler is idempotent — running twice with same state is safe
|
|
|
|
**Recovery pattern:**
|
|
1. Reconcile fails mid-apply
|
|
2. CI job fails, engineer notified
|
|
3. Engineer either forward-fixes `netbird.json` or `git revert`s the merge commit
|
|
4. New push triggers reconcile, converging to correct state
|
|
|
|
**Logging:**
|
|
- Structured JSON logs
|
|
- Every NetBird API call logged (method, path, status)
|
|
- Every state mutation logged (before/after)
|
|
- Event poller logs each event processed
|
|
|
|
## Resources Managed
|
|
|
|
| Resource | NetBird API | Create | Update | Delete |
|
|
|----------|-------------|--------|--------|--------|
|
|
| Groups | `/api/groups` | yes | yes (peers) | yes |
|
|
| Setup Keys | `/api/setup-keys` | yes | no (immutable) | yes |
|
|
| Peers | `/api/peers` | no (self-enroll) | yes (rename, groups) | yes |
|
|
| Policies | `/api/policies` | yes | yes | yes |
|
|
| Routes | `/api/routes` | yes | yes | yes |
|
|
| DNS | `/api/dns/nameservers` | yes | yes | yes |
|