Initial commit
This commit is contained in:
commit
2224d21c0e
309
docs/plans/2026-03-03-netbird-reconciler-design.md
Normal file
309
docs/plans/2026-03-03-netbird-reconciler-design.md
Normal file
@ -0,0 +1,309 @@
|
||||
# NetBird Reconciler — Design Document
|
||||
|
||||
> **Status:** Approved
|
||||
> **Author:** @prox
|
||||
> **Date:** 2026-03-03
|
||||
> **Proposal:** NetBird GitOps Proposal (rev2)
|
||||
|
||||
## Overview
|
||||
|
||||
A dedicated backend service that provides declarative GitOps-driven reconciliation for NetBird VPN configuration. Engineers declare desired state in `netbird.json`; the reconciler computes diffs and applies changes with all-or-nothing semantics.
|
||||
|
||||
**Repo:** `BlastPilot/netbird-gitops` (service code + state file in one repo)
|
||||
**Runtime:** TypeScript / Deno
|
||||
**Deployment:** Docker Compose on the NetBird VPS, behind Traefik
|
||||
|
||||
## Architecture
|
||||
|
||||
The reconciler has two responsibilities:
|
||||
|
||||
1. **Reconciliation API** — Called by Gitea Actions CI on PR events. Accepts desired state (`netbird.json`), fetches actual state from NetBird API, computes a diff, and either returns a plan (dry-run) or applies changes.
|
||||
|
||||
2. **Event Poller** — Background loop polling NetBird `/api/events` every 30s to detect peer enrollments. When a peer enrolls via a known setup key, the poller renames it, assigns it to the correct group, and commits `enrolled: true` back to git via Gitea API.
|
||||
|
||||
### Data Flow
|
||||
|
||||
```
|
||||
Engineer -> PR to netbird-gitops (edit netbird.json)
|
||||
-> CI: dry-run -> reconciler -> plan posted as PR comment
|
||||
-> PR merged -> CI: apply -> reconciler -> mutations to NetBird API
|
||||
-> response with created_keys
|
||||
-> CI: encrypt keys with age, upload artifact
|
||||
|
||||
Event poller (background):
|
||||
-> polls NetBird /api/events
|
||||
-> detects peer enrollment (peer.setupkey.add)
|
||||
-> renames peer, assigns groups
|
||||
-> commits enrolled:true via Gitea API
|
||||
```
|
||||
|
||||
### Integration with Enrollment Pipeline
|
||||
|
||||
The existing enrollment pipeline in `blastpilot-public` changes:
|
||||
|
||||
- **Before:** `handleApproval()` creates `peers/enrollment-{N}.json`, `handlePRMerge()` calls NetBird API directly to create setup keys, emails PDF.
|
||||
- **After:** `handleApproval()` modifies `netbird.json` (adds setup key + group entries) and creates PR. Key creation is handled by the reconciler on merge. Key delivery starts as manual (engineer downloads encrypted artifact), with automation added later.
|
||||
|
||||
## State File Format
|
||||
|
||||
`netbird.json` at repo root. All resources referenced by name, never by NetBird ID.
|
||||
|
||||
```json
|
||||
{
|
||||
"groups": {
|
||||
"pilots": { "peers": ["Pilot-hawk-72"] },
|
||||
"ground-stations": { "peers": ["GS-hawk-72"] },
|
||||
"commanders": { "peers": [] }
|
||||
},
|
||||
"setup_keys": {
|
||||
"GS-hawk-72": {
|
||||
"type": "one-off",
|
||||
"expires_in": 604800,
|
||||
"usage_limit": 1,
|
||||
"auto_groups": ["ground-stations"],
|
||||
"enrolled": true
|
||||
},
|
||||
"Pilot-hawk-72": {
|
||||
"type": "one-off",
|
||||
"expires_in": 604800,
|
||||
"usage_limit": 1,
|
||||
"auto_groups": ["pilots"],
|
||||
"enrolled": false
|
||||
}
|
||||
},
|
||||
"policies": {
|
||||
"pilots-to-gs": {
|
||||
"description": "Allow pilots to reach ground stations",
|
||||
"enabled": true,
|
||||
"sources": ["pilots"],
|
||||
"destinations": ["ground-stations"],
|
||||
"bidirectional": true,
|
||||
"protocol": "ALL"
|
||||
}
|
||||
},
|
||||
"routes": {
|
||||
"gs-local-network": {
|
||||
"description": "Route to GS local subnet",
|
||||
"network": "192.168.1.0/24",
|
||||
"peer_groups": ["ground-stations"],
|
||||
"enabled": true
|
||||
}
|
||||
},
|
||||
"dns": {
|
||||
"nameserver_groups": {}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Conventions:**
|
||||
- Setup key name = expected peer hostname
|
||||
- `enrolled: false` — setup key should exist, peer hasn't connected yet
|
||||
- `enrolled: true` — peer detected, renamed, assigned to groups
|
||||
- Groups reference peers by setup key name (becomes peer hostname after rename)
|
||||
- Policies reference groups by name
|
||||
- Reconciler maintains internal name-to-ID mapping fetched at plan time
|
||||
|
||||
## API Endpoints
|
||||
|
||||
All endpoints authenticated via `Authorization: Bearer <token>`.
|
||||
|
||||
### `POST /reconcile`
|
||||
|
||||
**Query params:** `dry_run=true|false` (default: false)
|
||||
**Request body:** Contents of `netbird.json`
|
||||
|
||||
Behavior:
|
||||
1. Fetch actual state from NetBird API (groups, setup keys, peers, policies, routes, DNS)
|
||||
2. Process pending enrollments from event poller state
|
||||
3. Compute diff between desired and actual
|
||||
4. If `dry_run=true`: return plan without applying
|
||||
5. If `dry_run=false`: execute in dependency order — groups, setup keys, peers, policies, routes. Abort on first failure.
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"status": "applied | planned | error",
|
||||
"operations": [
|
||||
{ "type": "create_group", "name": "pilots", "status": "success" },
|
||||
{ "type": "create_setup_key", "name": "Pilot-hawk-72", "status": "success" },
|
||||
{ "type": "create_policy", "name": "pilots-to-gs", "status": "failed", "error": "..." }
|
||||
],
|
||||
"created_keys": {
|
||||
"Pilot-hawk-72": "XXXXXX-XXXXXX-XXXXXX"
|
||||
},
|
||||
"summary": { "created": 3, "updated": 1, "deleted": 0, "failed": 0 }
|
||||
}
|
||||
```
|
||||
|
||||
`created_keys` only contains keys created in this run. CI uses this for encrypted artifacts.
|
||||
|
||||
### `POST /sync-events`
|
||||
|
||||
Forces the event poller to process pending events immediately. Returns detected enrollments.
|
||||
|
||||
```json
|
||||
{
|
||||
"enrollments": [
|
||||
{ "setup_key_name": "GS-hawk-72", "peer_id": "abc123", "renamed": true, "groups_assigned": true }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /health`
|
||||
|
||||
No auth. Returns service status for Docker healthcheck.
|
||||
|
||||
## Event Poller
|
||||
|
||||
**Mechanism:**
|
||||
- Polls `GET /api/events` every 30 seconds (configurable via `POLL_INTERVAL_SECONDS`)
|
||||
- Persists `last_event_timestamp` to `/data/poller-state.json` (Docker volume)
|
||||
- Loads last-known `netbird.json` desired state on startup and after each reconcile
|
||||
|
||||
**Enrollment detection:**
|
||||
1. Filter events for `peer.setupkey.add` activity
|
||||
2. Extract `setup_key_name` from event metadata
|
||||
3. Look up in desired state — if found and `enrolled: false`:
|
||||
- Rename peer to match setup key name via `PUT /api/peers/{id}`
|
||||
- Assign peer to groups from `setup_keys[name].auto_groups`
|
||||
- Commit `enrolled: true` to git via Gitea API (optimistic concurrency with SHA check)
|
||||
- Commit message: `chore: mark {key_name} as enrolled [automated]`
|
||||
4. If not found: log warning (unknown peer enrolled outside GitOps)
|
||||
|
||||
**Edge cases:**
|
||||
- Race with reconcile: if reconcile is in progress, enrollment processing queues until complete
|
||||
- Duplicate events: idempotent — skip if peer already renamed and enrolled
|
||||
- Unknown peers: logged but not touched
|
||||
|
||||
## CI Workflows
|
||||
|
||||
### `dry-run.yml` — On PR open/update
|
||||
|
||||
```yaml
|
||||
on:
|
||||
pull_request:
|
||||
paths: ['netbird.json']
|
||||
```
|
||||
|
||||
Steps:
|
||||
1. Checkout PR branch
|
||||
2. `POST /reconcile?dry_run=true` with `netbird.json`
|
||||
3. Format response as markdown table
|
||||
4. Post/update PR comment via Gitea API
|
||||
|
||||
### `reconcile.yml` — On push to main
|
||||
|
||||
```yaml
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
paths: ['netbird.json']
|
||||
```
|
||||
|
||||
Steps:
|
||||
1. Checkout repo
|
||||
2. `POST /sync-events` — process pending enrollments
|
||||
3. `POST /reconcile` with `netbird.json`
|
||||
4. If `created_keys` non-empty: encrypt with `age`, upload as Gitea Actions artifact
|
||||
5. Pull latest (poller may have committed)
|
||||
6. On failure: job fails, engineer investigates
|
||||
|
||||
### Gitea Secrets
|
||||
|
||||
| Secret | Purpose |
|
||||
|--------|---------|
|
||||
| `RECONCILER_URL` | Reconciler service URL |
|
||||
| `RECONCILER_TOKEN` | Bearer token for CI auth |
|
||||
| `AGE_PUBLIC_KEY` | Encrypts setup key artifacts |
|
||||
| `GITEA_TOKEN` | PR comment posting (achilles-ci-bot) |
|
||||
|
||||
## Deployment
|
||||
|
||||
Docker Compose on the NetBird VPS:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
netbird-reconciler:
|
||||
image: gitea.internal/blastpilot/netbird-reconciler:latest
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
NETBIRD_API_URL: "https://netbird.example.com/api"
|
||||
NETBIRD_API_TOKEN: "${NETBIRD_API_TOKEN}"
|
||||
GITEA_URL: "https://gitea.example.com"
|
||||
GITEA_TOKEN: "${GITEA_TOKEN}"
|
||||
GITEA_REPO: "BlastPilot/netbird-gitops"
|
||||
RECONCILER_TOKEN: "${RECONCILER_TOKEN}"
|
||||
POLL_INTERVAL_SECONDS: "30"
|
||||
PORT: "8080"
|
||||
volumes:
|
||||
- reconciler-data:/data
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
|
||||
interval: 30s
|
||||
timeout: 5s
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.reconciler.rule=Host(`reconciler.internal`)"
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Required | Description |
|
||||
|----------|----------|-------------|
|
||||
| `NETBIRD_API_URL` | yes | NetBird management API base URL |
|
||||
| `NETBIRD_API_TOKEN` | yes | NetBird API token |
|
||||
| `GITEA_URL` | yes | Gitea instance URL |
|
||||
| `GITEA_TOKEN` | yes | Gitea API token for commits |
|
||||
| `GITEA_REPO` | yes | `owner/repo` for netbird-gitops |
|
||||
| `RECONCILER_TOKEN` | yes | Bearer token for CI auth |
|
||||
| `POLL_INTERVAL_SECONDS` | no | Poll interval (default: 30) |
|
||||
| `PORT` | no | Listen port (default: 8080) |
|
||||
|
||||
### Container Image Build
|
||||
|
||||
Tag-triggered CI (`v*`) in netbird-gitops:
|
||||
1. `deno compile` to single binary
|
||||
2. Docker build (`FROM denoland/deno:distroless`)
|
||||
3. Push to Gitea container registry
|
||||
|
||||
## Error Handling & Rollback
|
||||
|
||||
**Validation phase (before mutations):**
|
||||
- Parse and validate `netbird.json` schema
|
||||
- Fetch all actual state
|
||||
- Compute diff and verify all operations are possible
|
||||
- If validation fails: return error, no mutations
|
||||
|
||||
**Apply phase:**
|
||||
- Execute in dependency order (groups -> keys -> peers -> policies -> routes)
|
||||
- On any failure: abort immediately, return partial results
|
||||
- No automatic rollback — git revert is the rollback mechanism
|
||||
|
||||
**Why no automatic rollback:**
|
||||
- Partial rollback is harder to get right than partial apply
|
||||
- Git history provides clear, auditable rollback path
|
||||
- `git revert` + re-reconcile converges to correct state
|
||||
- Reconciler is idempotent — running twice with same state is safe
|
||||
|
||||
**Recovery pattern:**
|
||||
1. Reconcile fails mid-apply
|
||||
2. CI job fails, engineer notified
|
||||
3. Engineer either forward-fixes `netbird.json` or `git revert`s the merge commit
|
||||
4. New push triggers reconcile, converging to correct state
|
||||
|
||||
**Logging:**
|
||||
- Structured JSON logs
|
||||
- Every NetBird API call logged (method, path, status)
|
||||
- Every state mutation logged (before/after)
|
||||
- Event poller logs each event processed
|
||||
|
||||
## Resources Managed
|
||||
|
||||
| Resource | NetBird API | Create | Update | Delete |
|
||||
|----------|-------------|--------|--------|--------|
|
||||
| Groups | `/api/groups` | yes | yes (peers) | yes |
|
||||
| Setup Keys | `/api/setup-keys` | yes | no (immutable) | yes |
|
||||
| Peers | `/api/peers` | no (self-enroll) | yes (rename, groups) | yes |
|
||||
| Policies | `/api/policies` | yes | yes | yes |
|
||||
| Routes | `/api/routes` | yes | yes | yes |
|
||||
| DNS | `/api/dns/nameservers` | yes | yes | yes |
|
||||
2996
docs/plans/2026-03-03-netbird-reconciler-implementation.md
Normal file
2996
docs/plans/2026-03-03-netbird-reconciler-implementation.md
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
x
Reference in New Issue
Block a user