273 lines
11 KiB
Markdown
273 lines
11 KiB
Markdown
# Reconciler PoC Validation — Design Document
|
|
|
|
> **Status:** Draft **Author:** @prox **Date:** 2026-03-06 **Depends on:**
|
|
> [NetBird Reconciler Design](2026-03-03-netbird-reconciler-design.md)
|
|
|
|
## Goal
|
|
|
|
Validate the reconciler end-to-end on a fresh, isolated NetBird instance before
|
|
pointing it at production. Prove that:
|
|
|
|
1. Declaring state in `netbird.json` → reconcile → resources appear in NetBird.
|
|
2. Event poller detects peer enrollment and renames the peer.
|
|
3. State export from a live NetBird instance produces a valid `netbird.json`.
|
|
|
|
## Scope
|
|
|
|
### In scope
|
|
|
|
- Deploy a self-contained stack on VPS-A (`vps-a.networkmonitor.cc`): fresh
|
|
NetBird, Caddy, Gitea, and reconciler — all via Docker Compose.
|
|
- `GITEA_ENABLED` feature flag so the reconciler works without Gitea
|
|
integration.
|
|
- State export tool: `GET /export` endpoint + `--export` CLI flag.
|
|
- Core reconcile: groups, setup keys, policies created via `/reconcile`.
|
|
- Event poller: detect enrollment, rename peer — with or without Gitea
|
|
commit-back.
|
|
|
|
### Out of scope (deferred)
|
|
|
|
- Enrollment pipeline integration (docs site → Gitea PR).
|
|
- CI workflows (dry-run on PR, reconcile on merge).
|
|
- Production deployment to real NetBird environments.
|
|
- Key encryption with `age` / artifact upload.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
VPS-A (vps-a.networkmonitor.cc)
|
|
├── Caddy (reverse proxy, HTTPS, ACME)
|
|
│ ├── / → NetBird Dashboard
|
|
│ ├── /api → NetBird Management API
|
|
│ ├── /signalexchange → Signal (gRPC)
|
|
│ ├── /relay → Relay
|
|
│ └── /reconciler/* → Reconciler HTTP API
|
|
├── NetBird Management (config, IdP, API)
|
|
├── NetBird Signal (gRPC peer coordination)
|
|
├── NetBird Relay (data relay for NATed peers)
|
|
├── Coturn (STUN/TURN)
|
|
├── Gitea (hosts netbird-gitops repo)
|
|
└── Reconciler (reconcile API + event poller)
|
|
```
|
|
|
|
All containers share a single Docker Compose stack with a common network. Caddy
|
|
terminates TLS and routes by path prefix.
|
|
|
|
## Changes to Reconciler
|
|
|
|
### 1. Feature Flag: `GITEA_ENABLED`
|
|
|
|
New environment variable. Default: `true` (backward compatible).
|
|
|
|
**When `GITEA_ENABLED=false`:**
|
|
|
|
| Component | Behavior |
|
|
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------ |
|
|
| Config validation | Skip `GITEA_*` env var requirements |
|
|
| Startup | Don't create Gitea client |
|
|
| `POST /reconcile` | Works normally — accepts `netbird.json` from request body, applies to NetBird API |
|
|
| Event poller | Still runs. Detects `peer.setupkey.add` events, renames peers. Skips commit-back of `enrolled: true`. Logs enrollment instead. |
|
|
| `GET /export` | Works normally — no Gitea dependency |
|
|
|
|
**When `GITEA_ENABLED=true`:** Current behavior, unchanged.
|
|
|
|
**Affected files:**
|
|
|
|
- `src/config.ts` — conditional Gitea env var validation
|
|
- `src/main.ts` — conditional Gitea client creation, pass flag to poller
|
|
- `src/poller/loop.ts` — guard commit-back behind flag
|
|
|
|
### 2. State Export
|
|
|
|
New module: `src/export.ts`
|
|
|
|
Transforms `ActualState` (from `src/state/actual.ts`) into a valid
|
|
`netbird.json` conforming to `DesiredStateSchema`.
|
|
|
|
**Mapping logic:**
|
|
|
|
| NetBird resource | Export strategy |
|
|
| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| Groups | Map ID → name. Skip auto-generated groups (`All`, `ch-` prefixed). Peer refs mapped to setup key names where possible, otherwise peer hostname. |
|
|
| Setup keys | Export with current config. Set `enrolled: true` if `used_times >= usage_limit`, else `false`. |
|
|
| Policies | Map source/destination group IDs → names. Include port rules. |
|
|
| Routes | Map group IDs → names, include network CIDRs. |
|
|
| DNS nameserver groups | Map group refs → names. |
|
|
|
|
**Interfaces:**
|
|
|
|
```
|
|
GET /export
|
|
→ 200: { state: <netbird.json content>, meta: { exported_at, source_url, groups_count, ... } }
|
|
|
|
CLI: deno run src/main.ts --export --netbird-api-url <url> --netbird-api-token <token>
|
|
→ stdout: netbird.json content
|
|
```
|
|
|
|
The CLI mode is standalone — it creates a NetBird client, fetches state,
|
|
exports, and exits. No HTTP server started.
|
|
|
|
**Affected files:**
|
|
|
|
- `src/export.ts` — new: transformation logic
|
|
- `src/server.ts` — new endpoint: `GET /export`
|
|
- `src/main.ts` — new CLI flag: `--export`
|
|
|
|
### 3. No Structural Changes
|
|
|
|
The reconcile engine (`diff.ts`, `executor.ts`), NetBird client, and state
|
|
schema remain unchanged. The export tool and feature flag are additive.
|
|
|
|
## Ansible Playbook
|
|
|
|
Location: `poc/ansible/` within this repo.
|
|
|
|
```
|
|
poc/
|
|
ansible/
|
|
inventory.yml
|
|
playbook.yml
|
|
group_vars/
|
|
all/
|
|
vars.yml # domain, ports, non-secret config
|
|
vault.yml # secrets (gitignored)
|
|
vault.yml.example # template for secrets
|
|
templates/
|
|
docker-compose.yml.j2
|
|
management.json.j2 # NetBird management config (embedded IdP)
|
|
Caddyfile.j2
|
|
dashboard.env.j2
|
|
relay.env.j2
|
|
turnserver.conf.j2
|
|
reconciler.env.j2
|
|
gitea.env.j2
|
|
```
|
|
|
|
**Playbook tasks:**
|
|
|
|
1. Install Docker + Docker Compose (if not present)
|
|
2. Create working directory structure
|
|
3. Template all config files
|
|
4. Pull images, `docker compose up -d`
|
|
5. Wait for Gitea to be ready
|
|
6. Create Gitea admin user + `BlastPilot` org + `netbird-gitops` repo via API
|
|
7. Seed `netbird.json` into the repo with initial test state
|
|
|
|
**Key config decisions:**
|
|
|
|
- **Caddy** for reverse proxy (proven in existing PoC templates).
|
|
- **Embedded IdP** for NetBird (no external OAuth — same as existing PoC).
|
|
- **Secrets auto-generated** at deploy time (NetBird encryption key, TURN
|
|
secret, relay secret). Printed to stdout for operator reference.
|
|
- Reconciler env vars templated from `vault.yml` (NetBird API token, Gitea
|
|
token).
|
|
|
|
**SSH key:** `~/.ssh/hetzner` (same as docs site deployment).
|
|
|
|
**Deploy command:** `ansible-playbook -i inventory.yml playbook.yml`
|
|
|
|
## Test netbird.json
|
|
|
|
The seed state for validation:
|
|
|
|
```json
|
|
{
|
|
"groups": {
|
|
"ground-stations": { "peers": [] },
|
|
"pilots": { "peers": [] }
|
|
},
|
|
"setup_keys": {
|
|
"GS-TestHawk-1": {
|
|
"type": "one-off",
|
|
"expires_in": 604800,
|
|
"usage_limit": 1,
|
|
"auto_groups": ["ground-stations"],
|
|
"enrolled": false
|
|
},
|
|
"Pilot-TestHawk-1": {
|
|
"type": "one-off",
|
|
"expires_in": 604800,
|
|
"usage_limit": 1,
|
|
"auto_groups": ["pilots"],
|
|
"enrolled": false
|
|
}
|
|
},
|
|
"policies": {
|
|
"pilots-to-gs": {
|
|
"enabled": true,
|
|
"sources": ["pilots"],
|
|
"destinations": ["ground-stations"],
|
|
"bidirectional": true
|
|
}
|
|
},
|
|
"routes": {},
|
|
"dns": { "nameserver_groups": {} }
|
|
}
|
|
```
|
|
|
|
This creates two groups, two one-off setup keys, and a bidirectional policy
|
|
between pilots and ground stations. Minimal but sufficient to validate the full
|
|
reconcile + enrollment flow.
|
|
|
|
## Validation Plan
|
|
|
|
### Phase 1 — Deploy
|
|
|
|
1. Wipe VPS-A (or just `docker compose down -v` if redeploying).
|
|
2. Run playbook → full stack up.
|
|
3. Access NetBird dashboard at `https://vps-a.networkmonitor.cc` — verify clean
|
|
state (only default "All" group).
|
|
4. Access Gitea at `https://vps-a.networkmonitor.cc/gitea` (or dedicated port) —
|
|
verify `BlastPilot/netbird-gitops` repo exists with seed `netbird.json`.
|
|
|
|
### Phase 2 — Reconcile
|
|
|
|
5. `curl -X POST https://vps-a.networkmonitor.cc/reconciler/reconcile?dry_run=true -d @netbird.json`
|
|
→ Verify plan shows: create 2 groups, 2 setup keys, 1 policy.
|
|
6. `curl -X POST https://vps-a.networkmonitor.cc/reconciler/reconcile -d @netbird.json`
|
|
→ Verify response includes `created_keys` with actual key values.
|
|
7. Open NetBird dashboard → verify groups, setup keys, and policy exist.
|
|
8. `curl https://vps-a.networkmonitor.cc/reconciler/export` → Compare exported
|
|
state with input. Verify round-trip consistency.
|
|
|
|
### Phase 3 — Enrollment
|
|
|
|
9. Copy a setup key value from step 6 response.
|
|
10. On a test machine: `netbird up --setup-key <key>`.
|
|
11. Check NetBird dashboard: peer appears, gets auto-renamed by poller, placed
|
|
in correct group.
|
|
12. Check reconciler logs: enrollment event detected, peer renamed, log entry
|
|
written (no Gitea commit since `GITEA_ENABLED=false` for initial test).
|
|
|
|
### Phase 4 — State Export (against real instance)
|
|
|
|
13. Run CLI export against `dev.netbird.achilles-rnd.cc`:
|
|
```
|
|
deno run src/main.ts --export \
|
|
--netbird-api-url https://dev.netbird.achilles-rnd.cc/api \
|
|
--netbird-api-token <token>
|
|
```
|
|
14. Review output — validates we can bootstrap GitOps from existing environment.
|
|
15. Optionally: dry-run reconcile the exported state against the same instance —
|
|
should produce an empty plan (no changes needed).
|
|
|
|
## Success Criteria
|
|
|
|
- [ ] Reconcile creates all declared resources in NetBird.
|
|
- [ ] Dry-run returns accurate plan without side effects.
|
|
- [ ] Export produces valid `netbird.json` from a live instance.
|
|
- [ ] Export → dry-run round-trip yields empty plan (idempotent).
|
|
- [ ] Poller detects enrollment and renames peer within 30s.
|
|
- [ ] Reconciler starts and operates correctly with `GITEA_ENABLED=false`.
|
|
- [ ] Reconciler starts and operates correctly with `GITEA_ENABLED=true` +
|
|
Gitea.
|
|
|
|
## Risks
|
|
|
|
| Risk | Mitigation |
|
|
| ------------------------------------------------------------- | -------------------------------------------------------------------------- |
|
|
| NetBird Management API behavior differs from docs | Testing against real instance; reconciler has comprehensive error handling |
|
|
| Export misses edge cases in resource mapping | Validate with dry-run round-trip (export → reconcile → empty plan) |
|
|
| Poller misses events during 30s poll interval | Acceptable for PoC; production can tune interval or add webhook trigger |
|
|
| VPS-A resources (2 vCPU, 4GB RAM) insufficient for full stack | Monitor; NetBird + Gitea are lightweight individually |
|