netbird-gitops/docs/plans/2026-03-06-reconciler-poc-validation.md
2026-03-06 13:21:08 +02:00

273 lines
11 KiB
Markdown

# Reconciler PoC Validation — Design Document
> **Status:** Draft **Author:** @prox **Date:** 2026-03-06 **Depends on:**
> [NetBird Reconciler Design](2026-03-03-netbird-reconciler-design.md)
## Goal
Validate the reconciler end-to-end on a fresh, isolated NetBird instance before
pointing it at production. Prove that:
1. Declaring state in `netbird.json` → reconcile → resources appear in NetBird.
2. Event poller detects peer enrollment and renames the peer.
3. State export from a live NetBird instance produces a valid `netbird.json`.
## Scope
### In scope
- Deploy a self-contained stack on VPS-A (`vps-a.networkmonitor.cc`): fresh
NetBird, Caddy, Gitea, and reconciler — all via Docker Compose.
- `GITEA_ENABLED` feature flag so the reconciler works without Gitea
integration.
- State export tool: `GET /export` endpoint + `--export` CLI flag.
- Core reconcile: groups, setup keys, policies created via `/reconcile`.
- Event poller: detect enrollment, rename peer — with or without Gitea
commit-back.
### Out of scope (deferred)
- Enrollment pipeline integration (docs site → Gitea PR).
- CI workflows (dry-run on PR, reconcile on merge).
- Production deployment to real NetBird environments.
- Key encryption with `age` / artifact upload.
## Architecture
```
VPS-A (vps-a.networkmonitor.cc)
├── Caddy (reverse proxy, HTTPS, ACME)
│ ├── / → NetBird Dashboard
│ ├── /api → NetBird Management API
│ ├── /signalexchange → Signal (gRPC)
│ ├── /relay → Relay
│ └── /reconciler/* → Reconciler HTTP API
├── NetBird Management (config, IdP, API)
├── NetBird Signal (gRPC peer coordination)
├── NetBird Relay (data relay for NATed peers)
├── Coturn (STUN/TURN)
├── Gitea (hosts netbird-gitops repo)
└── Reconciler (reconcile API + event poller)
```
All containers share a single Docker Compose stack with a common network. Caddy
terminates TLS and routes by path prefix.
## Changes to Reconciler
### 1. Feature Flag: `GITEA_ENABLED`
New environment variable. Default: `true` (backward compatible).
**When `GITEA_ENABLED=false`:**
| Component | Behavior |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| Config validation | Skip `GITEA_*` env var requirements |
| Startup | Don't create Gitea client |
| `POST /reconcile` | Works normally — accepts `netbird.json` from request body, applies to NetBird API |
| Event poller | Still runs. Detects `peer.setupkey.add` events, renames peers. Skips commit-back of `enrolled: true`. Logs enrollment instead. |
| `GET /export` | Works normally — no Gitea dependency |
**When `GITEA_ENABLED=true`:** Current behavior, unchanged.
**Affected files:**
- `src/config.ts` — conditional Gitea env var validation
- `src/main.ts` — conditional Gitea client creation, pass flag to poller
- `src/poller/loop.ts` — guard commit-back behind flag
### 2. State Export
New module: `src/export.ts`
Transforms `ActualState` (from `src/state/actual.ts`) into a valid
`netbird.json` conforming to `DesiredStateSchema`.
**Mapping logic:**
| NetBird resource | Export strategy |
| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| Groups | Map ID → name. Skip auto-generated groups (`All`, `ch-` prefixed). Peer refs mapped to setup key names where possible, otherwise peer hostname. |
| Setup keys | Export with current config. Set `enrolled: true` if `used_times >= usage_limit`, else `false`. |
| Policies | Map source/destination group IDs → names. Include port rules. |
| Routes | Map group IDs → names, include network CIDRs. |
| DNS nameserver groups | Map group refs → names. |
**Interfaces:**
```
GET /export
→ 200: { state: <netbird.json content>, meta: { exported_at, source_url, groups_count, ... } }
CLI: deno run src/main.ts --export --netbird-api-url <url> --netbird-api-token <token>
→ stdout: netbird.json content
```
The CLI mode is standalone — it creates a NetBird client, fetches state,
exports, and exits. No HTTP server started.
**Affected files:**
- `src/export.ts` — new: transformation logic
- `src/server.ts` — new endpoint: `GET /export`
- `src/main.ts` — new CLI flag: `--export`
### 3. No Structural Changes
The reconcile engine (`diff.ts`, `executor.ts`), NetBird client, and state
schema remain unchanged. The export tool and feature flag are additive.
## Ansible Playbook
Location: `poc/ansible/` within this repo.
```
poc/
ansible/
inventory.yml
playbook.yml
group_vars/
all/
vars.yml # domain, ports, non-secret config
vault.yml # secrets (gitignored)
vault.yml.example # template for secrets
templates/
docker-compose.yml.j2
management.json.j2 # NetBird management config (embedded IdP)
Caddyfile.j2
dashboard.env.j2
relay.env.j2
turnserver.conf.j2
reconciler.env.j2
gitea.env.j2
```
**Playbook tasks:**
1. Install Docker + Docker Compose (if not present)
2. Create working directory structure
3. Template all config files
4. Pull images, `docker compose up -d`
5. Wait for Gitea to be ready
6. Create Gitea admin user + `BlastPilot` org + `netbird-gitops` repo via API
7. Seed `netbird.json` into the repo with initial test state
**Key config decisions:**
- **Caddy** for reverse proxy (proven in existing PoC templates).
- **Embedded IdP** for NetBird (no external OAuth — same as existing PoC).
- **Secrets auto-generated** at deploy time (NetBird encryption key, TURN
secret, relay secret). Printed to stdout for operator reference.
- Reconciler env vars templated from `vault.yml` (NetBird API token, Gitea
token).
**SSH key:** `~/.ssh/hetzner` (same as docs site deployment).
**Deploy command:** `ansible-playbook -i inventory.yml playbook.yml`
## Test netbird.json
The seed state for validation:
```json
{
"groups": {
"ground-stations": { "peers": [] },
"pilots": { "peers": [] }
},
"setup_keys": {
"GS-TestHawk-1": {
"type": "one-off",
"expires_in": 604800,
"usage_limit": 1,
"auto_groups": ["ground-stations"],
"enrolled": false
},
"Pilot-TestHawk-1": {
"type": "one-off",
"expires_in": 604800,
"usage_limit": 1,
"auto_groups": ["pilots"],
"enrolled": false
}
},
"policies": {
"pilots-to-gs": {
"enabled": true,
"sources": ["pilots"],
"destinations": ["ground-stations"],
"bidirectional": true
}
},
"routes": {},
"dns": { "nameserver_groups": {} }
}
```
This creates two groups, two one-off setup keys, and a bidirectional policy
between pilots and ground stations. Minimal but sufficient to validate the full
reconcile + enrollment flow.
## Validation Plan
### Phase 1 — Deploy
1. Wipe VPS-A (or just `docker compose down -v` if redeploying).
2. Run playbook → full stack up.
3. Access NetBird dashboard at `https://vps-a.networkmonitor.cc` — verify clean
state (only default "All" group).
4. Access Gitea at `https://vps-a.networkmonitor.cc/gitea` (or dedicated port) —
verify `BlastPilot/netbird-gitops` repo exists with seed `netbird.json`.
### Phase 2 — Reconcile
5. `curl -X POST https://vps-a.networkmonitor.cc/reconciler/reconcile?dry_run=true -d @netbird.json`
→ Verify plan shows: create 2 groups, 2 setup keys, 1 policy.
6. `curl -X POST https://vps-a.networkmonitor.cc/reconciler/reconcile -d @netbird.json`
→ Verify response includes `created_keys` with actual key values.
7. Open NetBird dashboard → verify groups, setup keys, and policy exist.
8. `curl https://vps-a.networkmonitor.cc/reconciler/export` → Compare exported
state with input. Verify round-trip consistency.
### Phase 3 — Enrollment
9. Copy a setup key value from step 6 response.
10. On a test machine: `netbird up --setup-key <key>`.
11. Check NetBird dashboard: peer appears, gets auto-renamed by poller, placed
in correct group.
12. Check reconciler logs: enrollment event detected, peer renamed, log entry
written (no Gitea commit since `GITEA_ENABLED=false` for initial test).
### Phase 4 — State Export (against real instance)
13. Run CLI export against `dev.netbird.achilles-rnd.cc`:
```
deno run src/main.ts --export \
--netbird-api-url https://dev.netbird.achilles-rnd.cc/api \
--netbird-api-token <token>
```
14. Review output — validates we can bootstrap GitOps from existing environment.
15. Optionally: dry-run reconcile the exported state against the same instance —
should produce an empty plan (no changes needed).
## Success Criteria
- [ ] Reconcile creates all declared resources in NetBird.
- [ ] Dry-run returns accurate plan without side effects.
- [ ] Export produces valid `netbird.json` from a live instance.
- [ ] Export → dry-run round-trip yields empty plan (idempotent).
- [ ] Poller detects enrollment and renames peer within 30s.
- [ ] Reconciler starts and operates correctly with `GITEA_ENABLED=false`.
- [ ] Reconciler starts and operates correctly with `GITEA_ENABLED=true` +
Gitea.
## Risks
| Risk | Mitigation |
| ------------------------------------------------------------- | -------------------------------------------------------------------------- |
| NetBird Management API behavior differs from docs | Testing against real instance; reconciler has comprehensive error handling |
| Export misses edge cases in resource mapping | Validate with dry-run round-trip (export → reconcile → empty plan) |
| Poller misses events during 30s poll interval | Acceptable for PoC; production can tune interval or add webhook trigger |
| VPS-A resources (2 vCPU, 4GB RAM) insufficient for full stack | Monitor; NetBird + Gitea are lightweight individually |