Initial commit

This commit is contained in:
Prox 2026-03-03 23:45:05 +02:00
commit 2224d21c0e
2 changed files with 3305 additions and 0 deletions

View File

@ -0,0 +1,309 @@
# NetBird Reconciler — Design Document
> **Status:** Approved
> **Author:** @prox
> **Date:** 2026-03-03
> **Proposal:** NetBird GitOps Proposal (rev2)
## Overview
A dedicated backend service that provides declarative GitOps-driven reconciliation for NetBird VPN configuration. Engineers declare desired state in `netbird.json`; the reconciler computes diffs and applies changes with all-or-nothing semantics.
**Repo:** `BlastPilot/netbird-gitops` (service code + state file in one repo)
**Runtime:** TypeScript / Deno
**Deployment:** Docker Compose on the NetBird VPS, behind Traefik
## Architecture
The reconciler has two responsibilities:
1. **Reconciliation API** — Called by Gitea Actions CI on PR events. Accepts desired state (`netbird.json`), fetches actual state from NetBird API, computes a diff, and either returns a plan (dry-run) or applies changes.
2. **Event Poller** — Background loop polling NetBird `/api/events` every 30s to detect peer enrollments. When a peer enrolls via a known setup key, the poller renames it, assigns it to the correct group, and commits `enrolled: true` back to git via Gitea API.
### Data Flow
```
Engineer -> PR to netbird-gitops (edit netbird.json)
-> CI: dry-run -> reconciler -> plan posted as PR comment
-> PR merged -> CI: apply -> reconciler -> mutations to NetBird API
-> response with created_keys
-> CI: encrypt keys with age, upload artifact
Event poller (background):
-> polls NetBird /api/events
-> detects peer enrollment (peer.setupkey.add)
-> renames peer, assigns groups
-> commits enrolled:true via Gitea API
```
### Integration with Enrollment Pipeline
The existing enrollment pipeline in `blastpilot-public` changes:
- **Before:** `handleApproval()` creates `peers/enrollment-{N}.json`, `handlePRMerge()` calls NetBird API directly to create setup keys, emails PDF.
- **After:** `handleApproval()` modifies `netbird.json` (adds setup key + group entries) and creates PR. Key creation is handled by the reconciler on merge. Key delivery starts as manual (engineer downloads encrypted artifact), with automation added later.
## State File Format
`netbird.json` at repo root. All resources referenced by name, never by NetBird ID.
```json
{
"groups": {
"pilots": { "peers": ["Pilot-hawk-72"] },
"ground-stations": { "peers": ["GS-hawk-72"] },
"commanders": { "peers": [] }
},
"setup_keys": {
"GS-hawk-72": {
"type": "one-off",
"expires_in": 604800,
"usage_limit": 1,
"auto_groups": ["ground-stations"],
"enrolled": true
},
"Pilot-hawk-72": {
"type": "one-off",
"expires_in": 604800,
"usage_limit": 1,
"auto_groups": ["pilots"],
"enrolled": false
}
},
"policies": {
"pilots-to-gs": {
"description": "Allow pilots to reach ground stations",
"enabled": true,
"sources": ["pilots"],
"destinations": ["ground-stations"],
"bidirectional": true,
"protocol": "ALL"
}
},
"routes": {
"gs-local-network": {
"description": "Route to GS local subnet",
"network": "192.168.1.0/24",
"peer_groups": ["ground-stations"],
"enabled": true
}
},
"dns": {
"nameserver_groups": {}
}
}
```
**Conventions:**
- Setup key name = expected peer hostname
- `enrolled: false` — setup key should exist, peer hasn't connected yet
- `enrolled: true` — peer detected, renamed, assigned to groups
- Groups reference peers by setup key name (becomes peer hostname after rename)
- Policies reference groups by name
- Reconciler maintains internal name-to-ID mapping fetched at plan time
## API Endpoints
All endpoints authenticated via `Authorization: Bearer <token>`.
### `POST /reconcile`
**Query params:** `dry_run=true|false` (default: false)
**Request body:** Contents of `netbird.json`
Behavior:
1. Fetch actual state from NetBird API (groups, setup keys, peers, policies, routes, DNS)
2. Process pending enrollments from event poller state
3. Compute diff between desired and actual
4. If `dry_run=true`: return plan without applying
5. If `dry_run=false`: execute in dependency order — groups, setup keys, peers, policies, routes. Abort on first failure.
Response:
```json
{
"status": "applied | planned | error",
"operations": [
{ "type": "create_group", "name": "pilots", "status": "success" },
{ "type": "create_setup_key", "name": "Pilot-hawk-72", "status": "success" },
{ "type": "create_policy", "name": "pilots-to-gs", "status": "failed", "error": "..." }
],
"created_keys": {
"Pilot-hawk-72": "XXXXXX-XXXXXX-XXXXXX"
},
"summary": { "created": 3, "updated": 1, "deleted": 0, "failed": 0 }
}
```
`created_keys` only contains keys created in this run. CI uses this for encrypted artifacts.
### `POST /sync-events`
Forces the event poller to process pending events immediately. Returns detected enrollments.
```json
{
"enrollments": [
{ "setup_key_name": "GS-hawk-72", "peer_id": "abc123", "renamed": true, "groups_assigned": true }
]
}
```
### `GET /health`
No auth. Returns service status for Docker healthcheck.
## Event Poller
**Mechanism:**
- Polls `GET /api/events` every 30 seconds (configurable via `POLL_INTERVAL_SECONDS`)
- Persists `last_event_timestamp` to `/data/poller-state.json` (Docker volume)
- Loads last-known `netbird.json` desired state on startup and after each reconcile
**Enrollment detection:**
1. Filter events for `peer.setupkey.add` activity
2. Extract `setup_key_name` from event metadata
3. Look up in desired state — if found and `enrolled: false`:
- Rename peer to match setup key name via `PUT /api/peers/{id}`
- Assign peer to groups from `setup_keys[name].auto_groups`
- Commit `enrolled: true` to git via Gitea API (optimistic concurrency with SHA check)
- Commit message: `chore: mark {key_name} as enrolled [automated]`
4. If not found: log warning (unknown peer enrolled outside GitOps)
**Edge cases:**
- Race with reconcile: if reconcile is in progress, enrollment processing queues until complete
- Duplicate events: idempotent — skip if peer already renamed and enrolled
- Unknown peers: logged but not touched
## CI Workflows
### `dry-run.yml` — On PR open/update
```yaml
on:
pull_request:
paths: ['netbird.json']
```
Steps:
1. Checkout PR branch
2. `POST /reconcile?dry_run=true` with `netbird.json`
3. Format response as markdown table
4. Post/update PR comment via Gitea API
### `reconcile.yml` — On push to main
```yaml
on:
push:
branches: [main]
paths: ['netbird.json']
```
Steps:
1. Checkout repo
2. `POST /sync-events` — process pending enrollments
3. `POST /reconcile` with `netbird.json`
4. If `created_keys` non-empty: encrypt with `age`, upload as Gitea Actions artifact
5. Pull latest (poller may have committed)
6. On failure: job fails, engineer investigates
### Gitea Secrets
| Secret | Purpose |
|--------|---------|
| `RECONCILER_URL` | Reconciler service URL |
| `RECONCILER_TOKEN` | Bearer token for CI auth |
| `AGE_PUBLIC_KEY` | Encrypts setup key artifacts |
| `GITEA_TOKEN` | PR comment posting (achilles-ci-bot) |
## Deployment
Docker Compose on the NetBird VPS:
```yaml
services:
netbird-reconciler:
image: gitea.internal/blastpilot/netbird-reconciler:latest
restart: unless-stopped
environment:
NETBIRD_API_URL: "https://netbird.example.com/api"
NETBIRD_API_TOKEN: "${NETBIRD_API_TOKEN}"
GITEA_URL: "https://gitea.example.com"
GITEA_TOKEN: "${GITEA_TOKEN}"
GITEA_REPO: "BlastPilot/netbird-gitops"
RECONCILER_TOKEN: "${RECONCILER_TOKEN}"
POLL_INTERVAL_SECONDS: "30"
PORT: "8080"
volumes:
- reconciler-data:/data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
labels:
- "traefik.enable=true"
- "traefik.http.routers.reconciler.rule=Host(`reconciler.internal`)"
```
### Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `NETBIRD_API_URL` | yes | NetBird management API base URL |
| `NETBIRD_API_TOKEN` | yes | NetBird API token |
| `GITEA_URL` | yes | Gitea instance URL |
| `GITEA_TOKEN` | yes | Gitea API token for commits |
| `GITEA_REPO` | yes | `owner/repo` for netbird-gitops |
| `RECONCILER_TOKEN` | yes | Bearer token for CI auth |
| `POLL_INTERVAL_SECONDS` | no | Poll interval (default: 30) |
| `PORT` | no | Listen port (default: 8080) |
### Container Image Build
Tag-triggered CI (`v*`) in netbird-gitops:
1. `deno compile` to single binary
2. Docker build (`FROM denoland/deno:distroless`)
3. Push to Gitea container registry
## Error Handling & Rollback
**Validation phase (before mutations):**
- Parse and validate `netbird.json` schema
- Fetch all actual state
- Compute diff and verify all operations are possible
- If validation fails: return error, no mutations
**Apply phase:**
- Execute in dependency order (groups -> keys -> peers -> policies -> routes)
- On any failure: abort immediately, return partial results
- No automatic rollback — git revert is the rollback mechanism
**Why no automatic rollback:**
- Partial rollback is harder to get right than partial apply
- Git history provides clear, auditable rollback path
- `git revert` + re-reconcile converges to correct state
- Reconciler is idempotent — running twice with same state is safe
**Recovery pattern:**
1. Reconcile fails mid-apply
2. CI job fails, engineer notified
3. Engineer either forward-fixes `netbird.json` or `git revert`s the merge commit
4. New push triggers reconcile, converging to correct state
**Logging:**
- Structured JSON logs
- Every NetBird API call logged (method, path, status)
- Every state mutation logged (before/after)
- Event poller logs each event processed
## Resources Managed
| Resource | NetBird API | Create | Update | Delete |
|----------|-------------|--------|--------|--------|
| Groups | `/api/groups` | yes | yes (peers) | yes |
| Setup Keys | `/api/setup-keys` | yes | no (immutable) | yes |
| Peers | `/api/peers` | no (self-enroll) | yes (rename, groups) | yes |
| Policies | `/api/policies` | yes | yes | yes |
| Routes | `/api/routes` | yes | yes | yes |
| DNS | `/api/dns/nameservers` | yes | yes | yes |

File diff suppressed because it is too large Load Diff