netbird-gitops/docs/plans/2026-03-06-reconciler-poc-validation.md
2026-03-06 13:21:08 +02:00

11 KiB

Reconciler PoC Validation — Design Document

Status: Draft Author: @prox Date: 2026-03-06 Depends on: NetBird Reconciler Design

Goal

Validate the reconciler end-to-end on a fresh, isolated NetBird instance before pointing it at production. Prove that:

  1. Declaring state in netbird.json → reconcile → resources appear in NetBird.
  2. Event poller detects peer enrollment and renames the peer.
  3. State export from a live NetBird instance produces a valid netbird.json.

Scope

In scope

  • Deploy a self-contained stack on VPS-A (vps-a.networkmonitor.cc): fresh NetBird, Caddy, Gitea, and reconciler — all via Docker Compose.
  • GITEA_ENABLED feature flag so the reconciler works without Gitea integration.
  • State export tool: GET /export endpoint + --export CLI flag.
  • Core reconcile: groups, setup keys, policies created via /reconcile.
  • Event poller: detect enrollment, rename peer — with or without Gitea commit-back.

Out of scope (deferred)

  • Enrollment pipeline integration (docs site → Gitea PR).
  • CI workflows (dry-run on PR, reconcile on merge).
  • Production deployment to real NetBird environments.
  • Key encryption with age / artifact upload.

Architecture

VPS-A (vps-a.networkmonitor.cc)
├── Caddy (reverse proxy, HTTPS, ACME)
│   ├── /                 → NetBird Dashboard
│   ├── /api              → NetBird Management API
│   ├── /signalexchange   → Signal (gRPC)
│   ├── /relay            → Relay
│   └── /reconciler/*     → Reconciler HTTP API
├── NetBird Management    (config, IdP, API)
├── NetBird Signal        (gRPC peer coordination)
├── NetBird Relay         (data relay for NATed peers)
├── Coturn                (STUN/TURN)
├── Gitea                 (hosts netbird-gitops repo)
└── Reconciler            (reconcile API + event poller)

All containers share a single Docker Compose stack with a common network. Caddy terminates TLS and routes by path prefix.

Changes to Reconciler

1. Feature Flag: GITEA_ENABLED

New environment variable. Default: true (backward compatible).

When GITEA_ENABLED=false:

Component Behavior
Config validation Skip GITEA_* env var requirements
Startup Don't create Gitea client
POST /reconcile Works normally — accepts netbird.json from request body, applies to NetBird API
Event poller Still runs. Detects peer.setupkey.add events, renames peers. Skips commit-back of enrolled: true. Logs enrollment instead.
GET /export Works normally — no Gitea dependency

When GITEA_ENABLED=true: Current behavior, unchanged.

Affected files:

  • src/config.ts — conditional Gitea env var validation
  • src/main.ts — conditional Gitea client creation, pass flag to poller
  • src/poller/loop.ts — guard commit-back behind flag

2. State Export

New module: src/export.ts

Transforms ActualState (from src/state/actual.ts) into a valid netbird.json conforming to DesiredStateSchema.

Mapping logic:

NetBird resource Export strategy
Groups Map ID → name. Skip auto-generated groups (All, ch- prefixed). Peer refs mapped to setup key names where possible, otherwise peer hostname.
Setup keys Export with current config. Set enrolled: true if used_times >= usage_limit, else false.
Policies Map source/destination group IDs → names. Include port rules.
Routes Map group IDs → names, include network CIDRs.
DNS nameserver groups Map group refs → names.

Interfaces:

GET /export
  → 200: { state: <netbird.json content>, meta: { exported_at, source_url, groups_count, ... } }

CLI: deno run src/main.ts --export --netbird-api-url <url> --netbird-api-token <token>
  → stdout: netbird.json content

The CLI mode is standalone — it creates a NetBird client, fetches state, exports, and exits. No HTTP server started.

Affected files:

  • src/export.ts — new: transformation logic
  • src/server.ts — new endpoint: GET /export
  • src/main.ts — new CLI flag: --export

3. No Structural Changes

The reconcile engine (diff.ts, executor.ts), NetBird client, and state schema remain unchanged. The export tool and feature flag are additive.

Ansible Playbook

Location: poc/ansible/ within this repo.

poc/
  ansible/
    inventory.yml
    playbook.yml
    group_vars/
      all/
        vars.yml            # domain, ports, non-secret config
        vault.yml           # secrets (gitignored)
        vault.yml.example   # template for secrets
    templates/
      docker-compose.yml.j2
      management.json.j2   # NetBird management config (embedded IdP)
      Caddyfile.j2
      dashboard.env.j2
      relay.env.j2
      turnserver.conf.j2
      reconciler.env.j2
      gitea.env.j2

Playbook tasks:

  1. Install Docker + Docker Compose (if not present)
  2. Create working directory structure
  3. Template all config files
  4. Pull images, docker compose up -d
  5. Wait for Gitea to be ready
  6. Create Gitea admin user + BlastPilot org + netbird-gitops repo via API
  7. Seed netbird.json into the repo with initial test state

Key config decisions:

  • Caddy for reverse proxy (proven in existing PoC templates).
  • Embedded IdP for NetBird (no external OAuth — same as existing PoC).
  • Secrets auto-generated at deploy time (NetBird encryption key, TURN secret, relay secret). Printed to stdout for operator reference.
  • Reconciler env vars templated from vault.yml (NetBird API token, Gitea token).

SSH key: ~/.ssh/hetzner (same as docs site deployment).

Deploy command: ansible-playbook -i inventory.yml playbook.yml

Test netbird.json

The seed state for validation:

{
  "groups": {
    "ground-stations": { "peers": [] },
    "pilots": { "peers": [] }
  },
  "setup_keys": {
    "GS-TestHawk-1": {
      "type": "one-off",
      "expires_in": 604800,
      "usage_limit": 1,
      "auto_groups": ["ground-stations"],
      "enrolled": false
    },
    "Pilot-TestHawk-1": {
      "type": "one-off",
      "expires_in": 604800,
      "usage_limit": 1,
      "auto_groups": ["pilots"],
      "enrolled": false
    }
  },
  "policies": {
    "pilots-to-gs": {
      "enabled": true,
      "sources": ["pilots"],
      "destinations": ["ground-stations"],
      "bidirectional": true
    }
  },
  "routes": {},
  "dns": { "nameserver_groups": {} }
}

This creates two groups, two one-off setup keys, and a bidirectional policy between pilots and ground stations. Minimal but sufficient to validate the full reconcile + enrollment flow.

Validation Plan

Phase 1 — Deploy

  1. Wipe VPS-A (or just docker compose down -v if redeploying).
  2. Run playbook → full stack up.
  3. Access NetBird dashboard at https://vps-a.networkmonitor.cc — verify clean state (only default "All" group).
  4. Access Gitea at https://vps-a.networkmonitor.cc/gitea (or dedicated port) — verify BlastPilot/netbird-gitops repo exists with seed netbird.json.

Phase 2 — Reconcile

  1. curl -X POST https://vps-a.networkmonitor.cc/reconciler/reconcile?dry_run=true -d @netbird.json → Verify plan shows: create 2 groups, 2 setup keys, 1 policy.
  2. curl -X POST https://vps-a.networkmonitor.cc/reconciler/reconcile -d @netbird.json → Verify response includes created_keys with actual key values.
  3. Open NetBird dashboard → verify groups, setup keys, and policy exist.
  4. curl https://vps-a.networkmonitor.cc/reconciler/export → Compare exported state with input. Verify round-trip consistency.

Phase 3 — Enrollment

  1. Copy a setup key value from step 6 response.
  2. On a test machine: netbird up --setup-key <key>.
  3. Check NetBird dashboard: peer appears, gets auto-renamed by poller, placed in correct group.
  4. Check reconciler logs: enrollment event detected, peer renamed, log entry written (no Gitea commit since GITEA_ENABLED=false for initial test).

Phase 4 — State Export (against real instance)

  1. Run CLI export against dev.netbird.achilles-rnd.cc:
    deno run src/main.ts --export \
      --netbird-api-url https://dev.netbird.achilles-rnd.cc/api \
      --netbird-api-token <token>
    
  2. Review output — validates we can bootstrap GitOps from existing environment.
  3. Optionally: dry-run reconcile the exported state against the same instance — should produce an empty plan (no changes needed).

Success Criteria

  • Reconcile creates all declared resources in NetBird.
  • Dry-run returns accurate plan without side effects.
  • Export produces valid netbird.json from a live instance.
  • Export → dry-run round-trip yields empty plan (idempotent).
  • Poller detects enrollment and renames peer within 30s.
  • Reconciler starts and operates correctly with GITEA_ENABLED=false.
  • Reconciler starts and operates correctly with GITEA_ENABLED=true + Gitea.

Risks

Risk Mitigation
NetBird Management API behavior differs from docs Testing against real instance; reconciler has comprehensive error handling
Export misses edge cases in resource mapping Validate with dry-run round-trip (export → reconcile → empty plan)
Poller misses events during 30s poll interval Acceptable for PoC; production can tune interval or add webhook trigger
VPS-A resources (2 vCPU, 4GB RAM) insufficient for full stack Monitor; NetBird + Gitea are lightweight individually