Diagnosis & Provisioning Repeatability
Atelier's whole promise — declare an app and it just comes up, fully switched, on any fresh stack — is only credible if the platform can prove it on demand. The diagnosis subsystem is that proof: a from-zero discipline that stands up an isolated stack and asserts it self-provisioned with no hand-seeding, a catalog of build-time guards baked into the manifest builder, an in-product diagnostics gate that runs the platform's own checks per declared vertical, and a small fleet of read-only canaries, reconcilers, and healers. This is the "diagnosis" leg of the product: it converts the compound-software claims of every other section into things that fail a run when they regress, rather than things that quietly rot.
| Tool / artifact | Kind | What it proves / does | Anchor |
|---|---|---|---|
| from-zero stack | isolated environment | Stands up own Postgres / ontology / UMS / SpiceDB / Keycloak volumes under a dedicated compose project (occm-from-zero), never touching shared common-infra; teardown wipes only this project | scripts/prove_from_zero.sh, docker-compose.from-zero.yml |
| assert_from_zero | acceptance assertions | Asserts the per-app + pruned authz end-state self-provisioned with no manual reconcile: T1 per-app namespaces carry their rules, T2 admin-bff is pruned, T3 admin-bff keeps service-automation + control-plane rules, T4 cross-app deny-all isolation; tenant plane N1/N2 citizen read/apply rules live per-app | admin-api/scripts/assert_from_zero.py |
| build_sheet_manifest guards | build-time gate | --check exits 1 if the on-disk manifest is stale; _detect_route_collisions (fleet-wide portal-route collision), _detect_type_collisions (fleet-wide same-name entity type), dependency topo-order | admin-api/scripts/build_sheet_manifest.py |
| diagnose_all | fleet diagnostics gate | Runs the admin Diagnostics tab's own run_checks per declared identity.code under the tenant, with a hard exit code (import_sheets && diagnose_all); reports not-provisioned rather than silently skipping | admin-api/scripts/diagnose_all.py |
| empty_canary | blast-radius gate | Read-only: a reserved namespace (_empty_canary) resolves deny-all for every type/principal, and every real per-app namespace is deny-all for types it does not own — trips on any rule leak across namespaces | admin-api/scripts/empty_canary.py |
| resolve_compare / shadow_compare | parity gates | The RESOLVED output and the emitted RULE SETS match per real app across the per-app migration | admin-api/scripts/resolve_compare_per_app.py |
| heal_portal / reconcile_* | idempotent healers | Converge a tenant's portal rows to one clean default site; reconcile nav, role tiers, citizen ACLs, service principals, action parameters | admin-api/scripts/heal_portal.py, reconcile_*.py |
| check_sheet | per-sheet linter | Validates one sheet and (with --applied) reports whether it is provisioned on a tenant | admin-api/scripts/check_sheet.py |
| audit_ / assert_workflow_e2e* | targeted audits | Root-org coverage, citizen UUID keying, and the durable submit→callback→row workflow end-to-end | admin-api/scripts/audit_root_org_coverage.py, assert_workflow_e2e.py |
How it's declared. Diagnosis is not authored per vertical — it is a platform capability that consumes the same declarations every other subsystem reads. diagnose_all defaults its app list to every sheet_*.yaml's identity.code, so the moment a vertical's sheet exists, it is in scope for the gate. The in-product checks (ums namespace, kc client, public placements, orphan types, notification events, create-rule coverage, prepare_create shape, pinned runtime defaults, public_view tenant_id, …) are the cause-side mirror of the admin Diagnostics tab — the same run_checks an operator sees in the UI is what CI runs headless. New build guards are declared once, centrally, in build_sheet_manifest.py (e.g. the public-view-groups-by-enum and placement/uuid guards), not scattered per sheet.
How it's provisioned. The from-zero environment layers docker-compose.from-zero.yml over the shared app + common-infra compose files under a dedicated project name, bind-mounting host code so an un-merged fix can be proven before it lands. Teardown enables both the data and ontology profiles (the UMS + SpiceDB stack lives under ontology) so down -v actually wipes the whole plane — a --profile data down alone silently reused a dirty UMS DB, a trap the script header documents. The intended CI wiring is import_sheets && diagnose_all: a provision that doesn't pass its own checks fails the run.
Extension points (the growth vector). Add a new invariant as a check inside the shared run_checks (surfaces in both the admin Diagnostics tab and diagnose_all at once). Add a new build-time authoring guard as a _detect_* function in the manifest builder and register it in _classify/the build pass. Add a new from-zero acceptance assertion to assert_from_zero's _checks (template-plane always, tenant-plane behind --tenant). Add a new idempotent healer following the heal_portal pattern: dry-run by default, --apply to execute, never touches another tenant's rows, re-running on a clean tenant is a no-op.
Invariants.
- From-zero comes up fully switched. A fresh fork seeds no staff before the assertion, and the boot chain self-provisions per-app rules + admin-bff prune + per-tenant citizen read/apply with no manual reconcile or backfill (
assert_from_zeroT1–T4, N1–N2). - Deny-all is the floor. A namespace that should hold nothing resolves deny-all for every principal; cross-app isolation is gated on every deploy (
empty_canary). - Healers are convergent and tenant-scoped. Dry-run by default; idempotent; never mutate another tenant's rows.
- Diagnostics are exit-coded, not advisory.
0= all green,1= a failure/not-provisioned,2= stack unreachable — so a regression fails a run instead of hiding behind an on-demand tab. - Verify live before reverting. A diagnosis is a hypothesis until reproduced on the real stack; the test suite is left fully green (zero failing tests), with pre-existing failures fixed at root cause, not papered over.
Gaps & open edges.
- The admin-api pytest suite is not in CI (the Django suite skips it), so it rots; the standing mitigation is to run
-k "not live"to zero failures and diff against clean main in the same bare venv before claiming a regression — a discipline, not a gate. Wiring this suite into CI is the single highest-leverage hardening. - Fork fidelity is the recurring hazard the healers exist to paper over: the fork copies
portal_pagerows but has noportalFK-rewrite kind, leaking the source tenant's portal id (a cross-tenant FK leakheal_portalconverges), and template→tenant pinned UUIDs (e.g. a placement'soccurrence_type) survive a fork that should have rewired them. The durable fix is a complete FK-rewrite manifest in the fork plan (see App Provisioning and Durable Execution), not more healers. - Several proof seams are dev/boot-path fail-open and only run on demand (
diagnose_all, the canaries), so a regression is caught only if someone runs the gate; untilimport_sheets && diagnose_allis the enforced CI sweep, "it provisioned" and "it provisioned correctly" remain two different claims. - The from-zero workflow-e2e leg depends on bridging the shared martha-api onto the isolated network (no per-tenant Martha), so the durable-execution proof is only as isolated as that shared backend allows (see Durable Execution).