Skip to content

Production hardening

V1 is a marketing demo running on Vercel with no auth, no database, no secrets. This page is the runbook for promoting Oversight from demo to production. None of the items below are required to ship the marketing surface; all are required before a regulated firm submits real AR data.

Tick top to bottom. Every item is gated by the previous one.

Set in the Vercel project’s Production environment. Never check secrets into the repo.

VariablePurposeRotation
DATABASE_URLPostgres connection string. Read replicas configured separately.Rotate on suspected compromise; otherwise yearly.
SESSION_SIGNING_KEYHMAC-SHA256 key for signing session cookie payloads (defence-in-depth alongside the opaque session id).Quarterly. Rotate by issuing a new key, accept both old and new for the rotation window, then drop the old.
CSRF_SIGNING_KEYHMAC key for double-submit CSRF tokens.Quarterly.
STEP_UP_SIGNING_KEYHMAC key for short-lived step-up tokens.Quarterly.
POSTMARK_TOKENEmail delivery (invitations, password resets, deadline alerts).Yearly.
OBJECT_STORE_BUCKETBucket name (e.g. oversight-prod-attachments).Static.
OBJECT_STORE_REGIONRegion of the object store (must satisfy data residency, default eu-west-2).Static.
OBJECT_STORE_ACCESS_KEY_ID, OBJECT_STORE_SECRET_ACCESS_KEYIAM credentials with bucket-prefixed permissions only.Quarterly.
SENTRY_DSNError monitoring.Static; rotate on key compromise.
SENTRY_AUTH_TOKENSource-map upload during build.Yearly.
RATE_LIMIT_REDIS_URLToken-bucket store.Rotate on compromise.
FEATURE_FLAGS_API_KEY(Optional) feature-flag service.Per provider policy.
OPS_PAGER_WEBHOOKPagerDuty / Opsgenie webhook for P1 incidents.Yearly.

Set distinct values for Preview and Development environments. Preview should never connect to the production database.

Configured at the edge (Vercel middleware) and in the API handlers. The token-bucket store is Redis. Limits per route in API routes. Verify before launch:

  • Auth routes (POST /api/auth/session): 5/min/IP, with 10-fail-per-email-per-hour lockout.
  • Step-up (POST /api/auth/step-up): 10/min.
  • Notify-FCA (POST /api/breaches/:id/notify-fca): 10/min.
  • All other write routes: as documented.

429 responses include Retry-After and never leak account state.

Set via next.config.ts:

const securityHeaders = [
{
key: "Content-Security-Policy",
value: [
"default-src 'self'",
"script-src 'self' 'unsafe-inline'", // Inline scripts for Next.js bootstrap; tighten with nonces in v2
"style-src 'self' 'unsafe-inline'",
"img-src 'self' data: https:",
"font-src 'self' data:",
"connect-src 'self' https://o.sentry.io",
"frame-ancestors 'none'",
"base-uri 'self'",
"form-action 'self'",
].join("; "),
},
{ key: "Strict-Transport-Security", value: "max-age=63072000; includeSubDomains; preload" },
{ key: "X-Content-Type-Options", value: "nosniff" },
{ key: "X-Frame-Options", value: "DENY" },
{ key: "Referrer-Policy", value: "strict-origin-when-cross-origin" },
{ key: "Permissions-Policy", value: "camera=(), microphone=(), geolocation=()" },
];

Add nonce-based CSP in v2 once Next.js’s nonce support is stable across the app’s surface.

Sentry on both client and server. Source maps uploaded during build via SENTRY_AUTH_TOKEN. Release tagged with the deployed commit SHA.

// instrumentation.ts (Next.js convention)
import * as Sentry from "@sentry/nextjs";
Sentry.init({
dsn: process.env.SENTRY_DSN,
release: process.env.VERCEL_GIT_COMMIT_SHA,
tracesSampleRate: 0.1,
// Strip PII at the edge
beforeSend: (event) => {
if (event.request?.cookies) delete event.request.cookies;
if (event.user?.email) event.user.email = "[redacted]";
return event;
},
});

PagerDuty integration on Sentry’s error and fatal levels for the production project. Quiet hours not applied; supervisory failures cannot wait until Monday.

External monitor (Better Stack or Pingdom) hitting:

  • GET /api/health (the production canary endpoint, returns 200 only when Postgres is reachable and RLS migration check passes)
  • GET / (the marketing landing)
  • GET /demo/principal (a representative app surface)

From at least two regions (UK and EU). Page on three consecutive failures (90 seconds). Alert via the OPS_PAGER_WEBHOOK.

Cron (/audit-integrity, daily at 03:00 UTC):

// Pseudocode
async function checkAuditChain(tenantId: Ulid) {
let prevHash = ZERO_HASH;
for await (const row of streamAuditEvents(tenantId, { order: "asc" })) {
const expected = sha256(canonicalJson({ ...row, prevHash }));
if (expected !== row.hash || row.prevHash !== prevHash) {
pageOps({ severity: "P1", tenantId, atRow: row.id, expected, actual: row.hash });
return { ok: false, atRow: row.id };
}
prevHash = row.hash;
}
return { ok: true };
}

A mismatch is P1. The audit log surface displays a banner blocking export until the firm acknowledges.

Cron (/retention-sweep, daily at 02:00 UTC). Scans tenant-scoped tables for entities past their retention window (see Persona and tenant model) and:

  1. Deletes object-store attachments for soft-deleted parents.
  2. Hard-deletes the parent row.
  3. Writes a retention.sweep audit event per affected entity.
  4. Runs in transactions of 500 rows or fewer to avoid long locks.

The sweeper does not touch audit_events (10-year retention managed by year-shard partitions, dropped only when a whole year ages out).

A Subject Access Request handler. Given a userId, the script collects:

  • The User row.
  • All Session rows.
  • All AuditEvent rows where actorUserId = userId.
  • For ar-user: their AR’s BreachReport, FileReview, MIReturn, AnnualReview, ConductEvent rows.

Output is a single ZIP with one CSV per entity plus a README.txt summary. The script runs on demand, not on cron, and writes a gdpr.sar-export audit event.

Terminal window
# pseudo CLI
oversight-admin sar-export \
--tenant 01HW6V0A... \
--user 01HW6V8K... \
--out sar-01HW6V8K.zip

The script must be available in the production environment (containerised, runnable via vercel deploy --prebuilt to a one-off function or as an ops Lambda invoked by the support team).

  • Postgres: daily logical backup, hourly continuous WAL archive, 35-day point-in-time recovery, encrypted at rest, geo-redundant.
  • Object store: versioned bucket, 35-day version retention, replication to a secondary region.
  • Audit chain: included in Postgres backups; integrity check runs against backups quarterly.

Restore drills quarterly: spin up a restore from yesterday’s backup, run the integrity check, verify a known-good audit event is present.

pgmigrate or equivalent, with a CI step:

  1. Lint migrations (no destructive operations without a paired up/down).
  2. Apply against a freshly-seeded test database.
  3. Run the RLS check: every tenant-scoped table has ENABLE ROW LEVEL SECURITY, FORCE ROW LEVEL SECURITY, and a tenant_isolation policy.
  4. Refuse the deploy if any check fails.

Production migrations run on deploy with a 30-second statement timeout. Long migrations (index builds) use CONCURRENTLY and a separate migration window.

Verify the four step-up paths against production keys:

  • POST /api/breaches/:id/notify-fca
  • POST /api/annual-reviews/:id/sign-off
  • PATCH /api/ars/:id (only when transitioning to terminated)
  • PATCH /api/tenants/me/risk-weights

Each path returns 403 with error.code === "step_up_required" when the step-up token is absent or expired. The UI prompts re-step-up without losing form state.

Curl checks in CI:

  • A POST without X-CSRF-Token returns 403.
  • A POST with a stale CSRF token returns 403.
  • A POST after DELETE /api/auth/session returns 401.
  • A request with a cookie from the staging environment never authenticates against production.

Application logs go to Vercel’s runtime log stream and a long-term sink. Logs never contain:

  • Plain-text passwords or TOTP codes
  • Full session ids (truncated to first 8 chars)
  • CSRF tokens or step-up tokens
  • AR-customer PII from breach descriptions or file-review notes (those rows are loaded by id only at the log level; the database is the source of truth)

A pre-deploy linter scans for console.log and console.error calls and fails the build if any log a known-PII field.

Runbooks for:

  • Audit-chain mismatch
  • Database failover
  • Object-store outage
  • Postmark outage (queue emails locally, replay)
  • Vercel platform outage (rollback, status-page comms)

Stored in the company wiki. Linked from the Sentry issue templates so the on-call sees them.

Sign-off required from:

  • Engineering: items 1-6, 10-13
  • Compliance: items 7, 8, 14
  • Operations: items 9, 14

A launch checklist in the company wiki tracks each item against the responsible owner and the verification timestamp. Production traffic is enabled only after every item is green.