Skip to content

Diagnosis & Provisioning Repeatability

Atelier's whole promise — declare an app and it just comes up, fully switched, on any fresh stack — is only credible if the platform can prove it on demand. The diagnosis subsystem is that proof: a from-zero discipline that stands up an isolated stack and asserts it self-provisioned with no hand-seeding, a catalog of build-time guards baked into the manifest builder, an in-product diagnostics gate that runs the platform's own checks per declared vertical, and a small fleet of read-only canaries, reconcilers, and healers. This is the "diagnosis" leg of the product: it converts the compound-software claims of every other section into things that fail a run when they regress, rather than things that quietly rot.

Tool / artifactKindWhat it proves / doesAnchor
from-zero stackisolated environmentStands up own Postgres / ontology / UMS / SpiceDB / Keycloak volumes under a dedicated compose project (occm-from-zero), never touching shared common-infra; teardown wipes only this projectscripts/prove_from_zero.sh, docker-compose.from-zero.yml
assert_from_zeroacceptance assertionsAsserts the per-app + pruned authz end-state self-provisioned with no manual reconcile: T1 per-app namespaces carry their rules, T2 admin-bff is pruned, T3 admin-bff keeps service-automation + control-plane rules, T4 cross-app deny-all isolation; tenant plane N1/N2 citizen read/apply rules live per-appadmin-api/scripts/assert_from_zero.py
build_sheet_manifest guardsbuild-time gate--check exits 1 if the on-disk manifest is stale; _detect_route_collisions (fleet-wide portal-route collision), _detect_type_collisions (fleet-wide same-name entity type), dependency topo-orderadmin-api/scripts/build_sheet_manifest.py
diagnose_allfleet diagnostics gateRuns the admin Diagnostics tab's own run_checks per declared identity.code under the tenant, with a hard exit code (import_sheets && diagnose_all); reports not-provisioned rather than silently skippingadmin-api/scripts/diagnose_all.py
empty_canaryblast-radius gateRead-only: a reserved namespace (_empty_canary) resolves deny-all for every type/principal, and every real per-app namespace is deny-all for types it does not own — trips on any rule leak across namespacesadmin-api/scripts/empty_canary.py
resolve_compare / shadow_compareparity gatesThe RESOLVED output and the emitted RULE SETS match per real app across the per-app migrationadmin-api/scripts/resolve_compare_per_app.py
heal_portal / reconcile_*idempotent healersConverge a tenant's portal rows to one clean default site; reconcile nav, role tiers, citizen ACLs, service principals, action parametersadmin-api/scripts/heal_portal.py, reconcile_*.py
check_sheetper-sheet linterValidates one sheet and (with --applied) reports whether it is provisioned on a tenantadmin-api/scripts/check_sheet.py
audit_ / assert_workflow_e2e*targeted auditsRoot-org coverage, citizen UUID keying, and the durable submit→callback→row workflow end-to-endadmin-api/scripts/audit_root_org_coverage.py, assert_workflow_e2e.py

How it's declared. Diagnosis is not authored per vertical — it is a platform capability that consumes the same declarations every other subsystem reads. diagnose_all defaults its app list to every sheet_*.yaml's identity.code, so the moment a vertical's sheet exists, it is in scope for the gate. The in-product checks (ums namespace, kc client, public placements, orphan types, notification events, create-rule coverage, prepare_create shape, pinned runtime defaults, public_view tenant_id, …) are the cause-side mirror of the admin Diagnostics tab — the same run_checks an operator sees in the UI is what CI runs headless. New build guards are declared once, centrally, in build_sheet_manifest.py (e.g. the public-view-groups-by-enum and placement/uuid guards), not scattered per sheet.

How it's provisioned. The from-zero environment layers docker-compose.from-zero.yml over the shared app + common-infra compose files under a dedicated project name, bind-mounting host code so an un-merged fix can be proven before it lands. Teardown enables both the data and ontology profiles (the UMS + SpiceDB stack lives under ontology) so down -v actually wipes the whole plane — a --profile data down alone silently reused a dirty UMS DB, a trap the script header documents. The intended CI wiring is import_sheets && diagnose_all: a provision that doesn't pass its own checks fails the run.

Extension points (the growth vector). Add a new invariant as a check inside the shared run_checks (surfaces in both the admin Diagnostics tab and diagnose_all at once). Add a new build-time authoring guard as a _detect_* function in the manifest builder and register it in _classify/the build pass. Add a new from-zero acceptance assertion to assert_from_zero's _checks (template-plane always, tenant-plane behind --tenant). Add a new idempotent healer following the heal_portal pattern: dry-run by default, --apply to execute, never touches another tenant's rows, re-running on a clean tenant is a no-op.

Invariants.

  • From-zero comes up fully switched. A fresh fork seeds no staff before the assertion, and the boot chain self-provisions per-app rules + admin-bff prune + per-tenant citizen read/apply with no manual reconcile or backfill (assert_from_zero T1–T4, N1–N2).
  • Deny-all is the floor. A namespace that should hold nothing resolves deny-all for every principal; cross-app isolation is gated on every deploy (empty_canary).
  • Healers are convergent and tenant-scoped. Dry-run by default; idempotent; never mutate another tenant's rows.
  • Diagnostics are exit-coded, not advisory. 0 = all green, 1 = a failure/not-provisioned, 2 = stack unreachable — so a regression fails a run instead of hiding behind an on-demand tab.
  • Verify live before reverting. A diagnosis is a hypothesis until reproduced on the real stack; the test suite is left fully green (zero failing tests), with pre-existing failures fixed at root cause, not papered over.

Gaps & open edges.

  • The admin-api pytest suite is not in CI (the Django suite skips it), so it rots; the standing mitigation is to run -k "not live" to zero failures and diff against clean main in the same bare venv before claiming a regression — a discipline, not a gate. Wiring this suite into CI is the single highest-leverage hardening.
  • Fork fidelity is the recurring hazard the healers exist to paper over: the fork copies portal_page rows but has no portal FK-rewrite kind, leaking the source tenant's portal id (a cross-tenant FK leak heal_portal converges), and template→tenant pinned UUIDs (e.g. a placement's occurrence_type) survive a fork that should have rewired them. The durable fix is a complete FK-rewrite manifest in the fork plan (see App Provisioning and Durable Execution), not more healers.
  • Several proof seams are dev/boot-path fail-open and only run on demand (diagnose_all, the canaries), so a regression is caught only if someone runs the gate; until import_sheets && diagnose_all is the enforced CI sweep, "it provisioned" and "it provisioned correctly" remain two different claims.
  • The from-zero workflow-e2e leg depends on bridging the shared martha-api onto the isolated network (no per-tenant Martha), so the durable-execution proof is only as isolated as that shared backend allows (see Durable Execution).

Atelier — Platform Specification. Internal canonical reference.