Diagnosis | Atelier

Diagnosis & Provisioning Repeatability

Atelier's whole promise — declare an app and it just comes up, fully switched, on any fresh stack — is only credible if the platform can prove it on demand. The diagnosis subsystem is that proof: a from-zero discipline that stands up an isolated stack and asserts it self-provisioned with no hand-seeding, a catalog of build-time guards baked into the manifest builder, an in-product diagnostics gate that runs the platform's own checks per declared vertical, and a small fleet of read-only canaries, reconcilers, and healers. This is the "diagnosis" leg of the product: it converts the compound-software claims of every other section into things that fail a run when they regress, rather than things that quietly rot.

Tool / artifact	Kind	What it proves / does	Anchor
from-zero stack	isolated environment	Stands up own Postgres / ontology / UMS / SpiceDB / Keycloak volumes under a dedicated compose project (`occm-from-zero`), never touching shared common-infra; teardown wipes only this project	`scripts/prove_from_zero.sh`, `docker-compose.from-zero.yml`
assert_from_zero	acceptance assertions	Asserts the per-app + pruned authz end-state self-provisioned with no manual reconcile: T1 per-app namespaces carry their rules, T2 admin-bff is pruned, T3 admin-bff keeps service-automation + control-plane rules, T4 cross-app deny-all isolation; tenant plane N1/N2 citizen read/apply rules live per-app	`admin-api/scripts/assert_from_zero.py`
build_sheet_manifest guards	build-time gate	`--check` exits 1 if the on-disk manifest is stale; `_detect_route_collisions` (fleet-wide portal-route collision), `_detect_type_collisions` (fleet-wide same-name entity type), dependency topo-order	`admin-api/scripts/build_sheet_manifest.py`
diagnose_all	fleet diagnostics gate	Runs the admin Diagnostics tab's own `run_checks` per declared `identity.code` under the tenant, with a hard exit code (`import_sheets && diagnose_all`); reports `not-provisioned` rather than silently skipping	`admin-api/scripts/diagnose_all.py`
empty_canary	blast-radius gate	Read-only: a reserved namespace (`_empty_canary`) resolves deny-all for every type/principal, and every real per-app namespace is deny-all for types it does not own — trips on any rule leak across namespaces	`admin-api/scripts/empty_canary.py`
resolve_compare / shadow_compare	parity gates	The RESOLVED output and the emitted RULE SETS match per real app across the per-app migration	`admin-api/scripts/resolve_compare_per_app.py`
heal_portal / reconcile_*	idempotent healers	Converge a tenant's portal rows to one clean `default` site; reconcile nav, role tiers, citizen ACLs, service principals, action parameters	`admin-api/scripts/heal_portal.py`, `reconcile_*.py`
check_sheet	per-sheet linter	Validates one sheet and (with `--applied`) reports whether it is provisioned on a tenant	`admin-api/scripts/check_sheet.py`
audit_ / assert_workflow_e2e*	targeted audits	Root-org coverage, citizen UUID keying, and the durable submit→callback→row workflow end-to-end	`admin-api/scripts/audit_root_org_coverage.py`, `assert_workflow_e2e.py`

How it's declared. Diagnosis is not authored per vertical — it is a platform capability that consumes the same declarations every other subsystem reads. diagnose_all defaults its app list to every sheet_*.yaml's identity.code, so the moment a vertical's sheet exists, it is in scope for the gate. The in-product checks (ums namespace, kc client, public placements, orphan types, notification events, create-rule coverage, prepare_create shape, pinned runtime defaults, public_view tenant_id, …) are the cause-side mirror of the admin Diagnostics tab — the same run_checks an operator sees in the UI is what CI runs headless. New build guards are declared once, centrally, in build_sheet_manifest.py (e.g. the public-view-groups-by-enum and placement/uuid guards), not scattered per sheet.

How it's provisioned. The from-zero environment layers docker-compose.from-zero.yml over the shared app + common-infra compose files under a dedicated project name, bind-mounting host code so an un-merged fix can be proven before it lands. Teardown enables both the data and ontology profiles (the UMS + SpiceDB stack lives under ontology) so down -v actually wipes the whole plane — a --profile data down alone silently reused a dirty UMS DB, a trap the script header documents. The intended CI wiring is import_sheets && diagnose_all: a provision that doesn't pass its own checks fails the run.

Extension points (the growth vector). Add a new invariant as a check inside the shared run_checks (surfaces in both the admin Diagnostics tab and diagnose_all at once). Add a new build-time authoring guard as a _detect_* function in the manifest builder and register it in _classify/the build pass. Add a new from-zero acceptance assertion to assert_from_zero's _checks (template-plane always, tenant-plane behind --tenant). Add a new idempotent healer following the heal_portal pattern: dry-run by default, --apply to execute, never touches another tenant's rows, re-running on a clean tenant is a no-op.

Invariants.

From-zero comes up fully switched. A fresh fork seeds no staff before the assertion, and the boot chain self-provisions per-app rules + admin-bff prune + per-tenant citizen read/apply with no manual reconcile or backfill (assert_from_zero T1–T4, N1–N2).
Deny-all is the floor. A namespace that should hold nothing resolves deny-all for every principal; cross-app isolation is gated on every deploy (empty_canary).
Healers are convergent and tenant-scoped. Dry-run by default; idempotent; never mutate another tenant's rows.
Diagnostics are exit-coded, not advisory. 0 = all green, 1 = a failure/not-provisioned, 2 = stack unreachable — so a regression fails a run instead of hiding behind an on-demand tab.
Verify live before reverting. A diagnosis is a hypothesis until reproduced on the real stack; the test suite is left fully green (zero failing tests), with pre-existing failures fixed at root cause, not papered over.

Gaps & open edges.

The admin-api pytest suite is not in CI (the Django suite skips it), so it rots; the standing mitigation is to run -k "not live" to zero failures and diff against clean main in the same bare venv before claiming a regression — a discipline, not a gate. Wiring this suite into CI is the single highest-leverage hardening.
Fork fidelity is the recurring hazard the healers exist to paper over: the fork copies portal_page rows but has no portal FK-rewrite kind, leaking the source tenant's portal id (a cross-tenant FK leak heal_portal converges), and template→tenant pinned UUIDs (e.g. a placement's occurrence_type) survive a fork that should have rewired them. The durable fix is a complete FK-rewrite manifest in the fork plan (see App Provisioning and Durable Execution), not more healers.
Several proof seams are dev/boot-path fail-open and only run on demand (diagnose_all, the canaries), so a regression is caught only if someone runs the gate; until import_sheets && diagnose_all is the enforced CI sweep, "it provisioned" and "it provisioned correctly" remain two different claims.
The from-zero workflow-e2e leg depends on bridging the shared martha-api onto the isolated network (no per-tenant Martha), so the durable-execution proof is only as isolated as that shared backend allows (see Durable Execution).

Diagnosis & Provisioning Repeatability ​

Diagnosis & Provisioning Repeatability