11 KiB
Reconciler PoC Validation — Design Document
Status: Draft Author: @prox Date: 2026-03-06 Depends on: NetBird Reconciler Design
Goal
Validate the reconciler end-to-end on a fresh, isolated NetBird instance before pointing it at production. Prove that:
- Declaring state in
netbird.json→ reconcile → resources appear in NetBird. - Event poller detects peer enrollment and renames the peer.
- State export from a live NetBird instance produces a valid
netbird.json.
Scope
In scope
- Deploy a self-contained stack on VPS-A (
vps-a.networkmonitor.cc): fresh NetBird, Caddy, Gitea, and reconciler — all via Docker Compose. GITEA_ENABLEDfeature flag so the reconciler works without Gitea integration.- State export tool:
GET /exportendpoint +--exportCLI flag. - Core reconcile: groups, setup keys, policies created via
/reconcile. - Event poller: detect enrollment, rename peer — with or without Gitea commit-back.
Out of scope (deferred)
- Enrollment pipeline integration (docs site → Gitea PR).
- CI workflows (dry-run on PR, reconcile on merge).
- Production deployment to real NetBird environments.
- Key encryption with
age/ artifact upload.
Architecture
VPS-A (vps-a.networkmonitor.cc)
├── Caddy (reverse proxy, HTTPS, ACME)
│ ├── / → NetBird Dashboard
│ ├── /api → NetBird Management API
│ ├── /signalexchange → Signal (gRPC)
│ ├── /relay → Relay
│ └── /reconciler/* → Reconciler HTTP API
├── NetBird Management (config, IdP, API)
├── NetBird Signal (gRPC peer coordination)
├── NetBird Relay (data relay for NATed peers)
├── Coturn (STUN/TURN)
├── Gitea (hosts netbird-gitops repo)
└── Reconciler (reconcile API + event poller)
All containers share a single Docker Compose stack with a common network. Caddy terminates TLS and routes by path prefix.
Changes to Reconciler
1. Feature Flag: GITEA_ENABLED
New environment variable. Default: true (backward compatible).
When GITEA_ENABLED=false:
| Component | Behavior |
|---|---|
| Config validation | Skip GITEA_* env var requirements |
| Startup | Don't create Gitea client |
POST /reconcile |
Works normally — accepts netbird.json from request body, applies to NetBird API |
| Event poller | Still runs. Detects peer.setupkey.add events, renames peers. Skips commit-back of enrolled: true. Logs enrollment instead. |
GET /export |
Works normally — no Gitea dependency |
When GITEA_ENABLED=true: Current behavior, unchanged.
Affected files:
src/config.ts— conditional Gitea env var validationsrc/main.ts— conditional Gitea client creation, pass flag to pollersrc/poller/loop.ts— guard commit-back behind flag
2. State Export
New module: src/export.ts
Transforms ActualState (from src/state/actual.ts) into a valid
netbird.json conforming to DesiredStateSchema.
Mapping logic:
| NetBird resource | Export strategy |
|---|---|
| Groups | Map ID → name. Skip auto-generated groups (All, ch- prefixed). Peer refs mapped to setup key names where possible, otherwise peer hostname. |
| Setup keys | Export with current config. Set enrolled: true if used_times >= usage_limit, else false. |
| Policies | Map source/destination group IDs → names. Include port rules. |
| Routes | Map group IDs → names, include network CIDRs. |
| DNS nameserver groups | Map group refs → names. |
Interfaces:
GET /export
→ 200: { state: <netbird.json content>, meta: { exported_at, source_url, groups_count, ... } }
CLI: deno run src/main.ts --export --netbird-api-url <url> --netbird-api-token <token>
→ stdout: netbird.json content
The CLI mode is standalone — it creates a NetBird client, fetches state, exports, and exits. No HTTP server started.
Affected files:
src/export.ts— new: transformation logicsrc/server.ts— new endpoint:GET /exportsrc/main.ts— new CLI flag:--export
3. No Structural Changes
The reconcile engine (diff.ts, executor.ts), NetBird client, and state
schema remain unchanged. The export tool and feature flag are additive.
Ansible Playbook
Location: poc/ansible/ within this repo.
poc/
ansible/
inventory.yml
playbook.yml
group_vars/
all/
vars.yml # domain, ports, non-secret config
vault.yml # secrets (gitignored)
vault.yml.example # template for secrets
templates/
docker-compose.yml.j2
management.json.j2 # NetBird management config (embedded IdP)
Caddyfile.j2
dashboard.env.j2
relay.env.j2
turnserver.conf.j2
reconciler.env.j2
gitea.env.j2
Playbook tasks:
- Install Docker + Docker Compose (if not present)
- Create working directory structure
- Template all config files
- Pull images,
docker compose up -d - Wait for Gitea to be ready
- Create Gitea admin user +
BlastPilotorg +netbird-gitopsrepo via API - Seed
netbird.jsoninto the repo with initial test state
Key config decisions:
- Caddy for reverse proxy (proven in existing PoC templates).
- Embedded IdP for NetBird (no external OAuth — same as existing PoC).
- Secrets auto-generated at deploy time (NetBird encryption key, TURN secret, relay secret). Printed to stdout for operator reference.
- Reconciler env vars templated from
vault.yml(NetBird API token, Gitea token).
SSH key: ~/.ssh/hetzner (same as docs site deployment).
Deploy command: ansible-playbook -i inventory.yml playbook.yml
Test netbird.json
The seed state for validation:
{
"groups": {
"ground-stations": { "peers": [] },
"pilots": { "peers": [] }
},
"setup_keys": {
"GS-TestHawk-1": {
"type": "one-off",
"expires_in": 604800,
"usage_limit": 1,
"auto_groups": ["ground-stations"],
"enrolled": false
},
"Pilot-TestHawk-1": {
"type": "one-off",
"expires_in": 604800,
"usage_limit": 1,
"auto_groups": ["pilots"],
"enrolled": false
}
},
"policies": {
"pilots-to-gs": {
"enabled": true,
"sources": ["pilots"],
"destinations": ["ground-stations"],
"bidirectional": true
}
},
"routes": {},
"dns": { "nameserver_groups": {} }
}
This creates two groups, two one-off setup keys, and a bidirectional policy between pilots and ground stations. Minimal but sufficient to validate the full reconcile + enrollment flow.
Validation Plan
Phase 1 — Deploy
- Wipe VPS-A (or just
docker compose down -vif redeploying). - Run playbook → full stack up.
- Access NetBird dashboard at
https://vps-a.networkmonitor.cc— verify clean state (only default "All" group). - Access Gitea at
https://vps-a.networkmonitor.cc/gitea(or dedicated port) — verifyBlastPilot/netbird-gitopsrepo exists with seednetbird.json.
Phase 2 — Reconcile
curl -X POST https://vps-a.networkmonitor.cc/reconciler/reconcile?dry_run=true -d @netbird.json→ Verify plan shows: create 2 groups, 2 setup keys, 1 policy.curl -X POST https://vps-a.networkmonitor.cc/reconciler/reconcile -d @netbird.json→ Verify response includescreated_keyswith actual key values.- Open NetBird dashboard → verify groups, setup keys, and policy exist.
curl https://vps-a.networkmonitor.cc/reconciler/export→ Compare exported state with input. Verify round-trip consistency.
Phase 3 — Enrollment
- Copy a setup key value from step 6 response.
- On a test machine:
netbird up --setup-key <key>. - Check NetBird dashboard: peer appears, gets auto-renamed by poller, placed in correct group.
- Check reconciler logs: enrollment event detected, peer renamed, log entry
written (no Gitea commit since
GITEA_ENABLED=falsefor initial test).
Phase 4 — State Export (against real instance)
- Run CLI export against
dev.netbird.achilles-rnd.cc:deno run src/main.ts --export \ --netbird-api-url https://dev.netbird.achilles-rnd.cc/api \ --netbird-api-token <token> - Review output — validates we can bootstrap GitOps from existing environment.
- Optionally: dry-run reconcile the exported state against the same instance — should produce an empty plan (no changes needed).
Success Criteria
- Reconcile creates all declared resources in NetBird.
- Dry-run returns accurate plan without side effects.
- Export produces valid
netbird.jsonfrom a live instance. - Export → dry-run round-trip yields empty plan (idempotent).
- Poller detects enrollment and renames peer within 30s.
- Reconciler starts and operates correctly with
GITEA_ENABLED=false. - Reconciler starts and operates correctly with
GITEA_ENABLED=true+ Gitea.
Risks
| Risk | Mitigation |
|---|---|
| NetBird Management API behavior differs from docs | Testing against real instance; reconciler has comprehensive error handling |
| Export misses edge cases in resource mapping | Validate with dry-run round-trip (export → reconcile → empty plan) |
| Poller misses events during 30s poll interval | Acceptable for PoC; production can tune interval or add webhook trigger |
| VPS-A resources (2 vCPU, 4GB RAM) insufficient for full stack | Monitor; NetBird + Gitea are lightweight individually |