Skip to content

Cold-start and recovery

The demo loses no irreplaceable state because the fixture set is deterministic and the only mutable state is the in-session walkthrough. Production loses no irreplaceable state because every regulated action persists before responding 2xx. This page documents the recovery rules for every cold-start scenario and the failure modes the system is designed to fall into.

useDemoStore (lib/state.ts) persists to localStorage under key lao-demo-state. The partialize function selects which keys persist:

{
name: "lao-demo-state",
storage: createJSONStorage(() => localStorage),
partialize: (state) => ({
skin: state.skin,
// Persona, mode, walkthrough step, and live additions all reset
// every session so the visitor always lands in scripted mode
// with a clean fixture set.
}),
}

Persisted across reloads:

  • skin (the active tenant)

Reset on every page load:

  • persona (defaults to principal-compliance-officer)
  • focusedArId (defaults to null)
  • mode (defaults to scripted)
  • walkthroughStep (defaults to 0)
  • personaSwitchSeen (defaults to false)
  • liveBreaches (defaults to empty array)
  • liveMIReturns (defaults to empty array)

The visitor returning to the demo lands on the same skin they last viewed, but with a fresh walkthrough, no in-session writes, and the persona switch confirmation modal armed.

The visitor is on step 6 (AR-side MI return form). They refresh.

What happens:

  1. useDemoStore rehydrates. skin survives. Everything else resets.
  2. The route they were on (/demo/ar/mi) still resolves; the surface is independent of walkthrough step.
  3. walkthroughStep is 0; the walkthrough overlay is at step 0 (“Welcome to Oversight”), but the URL is /demo/ar/mi.
  4. The walkthrough-advancer component watches the path and bumps the step floor when the visitor lands on a known surface, so step advances to 6 once the route matches.

The flow continues. No data was lost because nothing the visitor did was a regulated action; the MI return draft is held in the form’s local React state, not the store.

Browser refresh after filing a breach AR-side

Section titled “Browser refresh after filing a breach AR-side”

The visitor filed a breach on step 7. They refresh before reaching the principal-side triage queue (step 8).

What happens:

  1. useDemoStore rehydrates. liveBreaches resets to empty.
  2. The walkthrough advances to step 8 once the visitor reaches /demo/principal/breaches.
  3. The triage queue renders fixture breaches only; the in-session breach is gone.

This is acceptable for a demo: the fixture set already contains a breach designed to look fresh on the queue. The walkthrough copy avoids referring to “the breach you just filed” as if continuity is guaranteed.

A “Reset demo” control in the chrome calls useDemoStore.resetWalkthrough():

resetWalkthrough: () =>
set({ walkthroughStep: 0, mode: "scripted", personaSwitchSeen: false }),

The visitor is bounced to step 0 with mode reset to scripted. liveBreaches and liveMIReturns are not cleared by this action; the visitor can also use the more aggressive “Clear demo data” control which clears them too.

A visitor with localStorage blocked (Safari Private mode default, some enterprise browser policies):

  1. createJSONStorage(() => localStorage) returns a storage that throws on setItem.
  2. Zustand’s persist middleware swallows the throw and continues with in-memory state.
  3. The visitor experiences the demo correctly within one tab session; closing the tab loses state.

No regression. The demo is designed not to require persistence.

Every regulated action persists before the API returns a 2xx response. The ordering constraint is:

  1. Validate input (Zod, business rules, state-machine transition).
  2. Open a transaction inside the tenant-scoped middleware (app.tenant_id GUC set).
  3. Update the entity row.
  4. Append the audit event with hash-chain link.
  5. Trigger any side effects (risk recompute, deadline alerter, FCA bundle generation) idempotently.
  6. Commit.
  7. Return 2xx.

A crash between any of steps 3-6 rolls back the transaction. The client sees a 5xx and retries. The audit chain is never written without the entity update, and vice versa.

Side effects that escape the transaction boundary (sending email, generating a PDF) are queued via pgmq (Postgres-backed message queue) and consumed by workers. The queue insert is part of the transaction, so the side effect is durable as soon as the transaction commits.

What happens:

  1. In-flight requests fail with a 5xx; clients retry.
  2. Sessions are unaffected (rows in Postgres).
  3. Workers reconnect to pgmq and resume.
  4. The deadline alerter is idempotent: it dedupes by breach id and a “last alerted at” column.
  5. The risk recompute worker is idempotent: each recompute writes a new history row keyed by (arId, computedAt); duplicate triggers produce duplicate rows that the trajectory query handles.

No data loss. No double-side-effect (email or SMS deduped via pgmq visibility timeout).

The user was halfway through a file-review workspace, with unsaved findings.

What happens:

  1. Inline saving is the default: every finding edit PATCHes /api/reviews/:id with If-Match. By the time the user refreshes, every finding they edited is persisted.
  2. The unsaved finding (the one they were typing into when the refresh fired) is lost.
  3. The review remains in InProgress; the user resumes from where they left off, with the lost finding empty.

The UX cost is one re-typed finding. No regulated action is lost.

The user’s session expired mid-action.

What happens:

  1. The next request returns 401.
  2. The client redirects to /sign-in?next=<current-path>.
  3. The user re-authenticates; the redirect lands them back on the surface they were on.
  4. Any unsaved form state is preserved by the client’s optimistic-state cache (React Query, with cacheTime exceeding the sign-in flow).

For step-up-gated actions (notify FCA, sign off annual review), an expired step-up token returns 403 with error.code === "step_up_required". The UI prompts re-step-up without losing the form.

Hash-chain mismatch (production-only failure mode)

Section titled “Hash-chain mismatch (production-only failure mode)”

The nightly integrity job recomputes every audit event’s hash and compares against the stored value. A mismatch indicates either a software bug or tampering.

What happens:

  1. The job pages the firm with a P1 incident.
  2. The audit log surface displays a banner: “Integrity check failed at 2026-05-08T03:14:00Z. Records up to are verified. Records after that point are under review. Contact support.”
  3. The firm cannot export the chain until the mismatch is resolved (forensic restore from backup, software fix, or both).

This is the safe failure mode: the system makes the failure visible rather than silently continuing.

GET /api/ars, GET /api/breaches, etc. are idempotent. GET /api/audit uses cursor pagination that ignores writes after the cursor’s at value, so a long-running read is consistent. Two concurrent reads of the same audit page return the same rows.

Write routes that could be retried use the request’s Idempotency-Key header (UUID generated client-side) to dedupe. The handler stores (tenantId, idempotencyKey, response) in a 24-hour cache; a duplicate request returns the cached response.

POST /api/mi-returns is also idempotent on (tenantId, arId, period) regardless of the Idempotency-Key header. A second submission for the same period returns 409 with error.code === "period_already_submitted" and a link to the existing return.

The walkthrough overlay, persona-switch confirmation modal, and toast notifications all expire on a timer. They never block navigation or input. A visitor who walks away mid-walkthrough returns to a quiet UI; the overlay re-anchors when they next interact.

The risk-score-explainer tooltip and the breach-deadline countdown re-render on a 1-second interval. Both are pure functions of the data they reference; the interval can stop and restart with no state loss.

A new deployment goes live behind Vercel’s atomic-swap. The previous deployment serves until the new one is healthy. Health checks verify:

  1. GET /api/health returns 200 with database connectivity confirmed.
  2. GET /api/version returns the expected commit SHA.
  3. The migration check confirms every tenant-scoped table has RLS enabled.

A failed health check rolls back automatically. See Production hardening.