Production hardening

V1 is a marketing demo running on Vercel with no auth, no database, no secrets. This page is the runbook for promoting Oversight from demo to production. None of the items below are required to ship the marketing surface; all are required before a regulated firm submits real AR data.

Pre-launch checklist

Tick top to bottom. Every item is gated by the previous one.

1. Environment variables

Set in the Vercel project’s Production environment. Never check secrets into the repo.

Variable	Purpose	Rotation
`DATABASE_URL`	Postgres connection string. Read replicas configured separately.	Rotate on suspected compromise; otherwise yearly.
`SESSION_SIGNING_KEY`	HMAC-SHA256 key for signing session cookie payloads (defence-in-depth alongside the opaque session id).	Quarterly. Rotate by issuing a new key, accept both old and new for the rotation window, then drop the old.
`CSRF_SIGNING_KEY`	HMAC key for double-submit CSRF tokens.	Quarterly.
`STEP_UP_SIGNING_KEY`	HMAC key for short-lived step-up tokens.	Quarterly.
`POSTMARK_TOKEN`	Email delivery (invitations, password resets, deadline alerts).	Yearly.
`OBJECT_STORE_BUCKET`	Bucket name (e.g. `oversight-prod-attachments`).	Static.
`OBJECT_STORE_REGION`	Region of the object store (must satisfy data residency, default `eu-west-2`).	Static.
`OBJECT_STORE_ACCESS_KEY_ID`, `OBJECT_STORE_SECRET_ACCESS_KEY`	IAM credentials with bucket-prefixed permissions only.	Quarterly.
`SENTRY_DSN`	Error monitoring.	Static; rotate on key compromise.
`SENTRY_AUTH_TOKEN`	Source-map upload during build.	Yearly.
`RATE_LIMIT_REDIS_URL`	Token-bucket store.	Rotate on compromise.
`FEATURE_FLAGS_API_KEY`	(Optional) feature-flag service.	Per provider policy.
`OPS_PAGER_WEBHOOK`	PagerDuty / Opsgenie webhook for P1 incidents.	Yearly.

Set distinct values for Preview and Development environments. Preview should never connect to the production database.

2. Rate limits

Configured at the edge (Vercel middleware) and in the API handlers. The token-bucket store is Redis. Limits per route in API routes. Verify before launch:

Auth routes (POST /api/auth/session): 5/min/IP, with 10-fail-per-email-per-hour lockout.
Step-up (POST /api/auth/step-up): 10/min.
Notify-FCA (POST /api/breaches/:id/notify-fca): 10/min.
All other write routes: as documented.

429 responses include Retry-After and never leak account state.

3. CSP and security headers

Set via next.config.ts:

const securityHeaders = [
  {
    key: "Content-Security-Policy",
    value: [
      "default-src 'self'",
      "script-src 'self' 'unsafe-inline'",  // Inline scripts for Next.js bootstrap; tighten with nonces in v2
      "style-src 'self' 'unsafe-inline'",
      "img-src 'self' data: https:",
      "font-src 'self' data:",
      "connect-src 'self' https://o.sentry.io",
      "frame-ancestors 'none'",
      "base-uri 'self'",
      "form-action 'self'",
    ].join("; "),
  },
  { key: "Strict-Transport-Security", value: "max-age=63072000; includeSubDomains; preload" },
  { key: "X-Content-Type-Options", value: "nosniff" },
  { key: "X-Frame-Options", value: "DENY" },
  { key: "Referrer-Policy", value: "strict-origin-when-cross-origin" },
  { key: "Permissions-Policy", value: "camera=(), microphone=(), geolocation=()" },
];

Add nonce-based CSP in v2 once Next.js’s nonce support is stable across the app’s surface.

4. Error monitoring

Sentry on both client and server. Source maps uploaded during build via SENTRY_AUTH_TOKEN. Release tagged with the deployed commit SHA.

// instrumentation.ts (Next.js convention)
import * as Sentry from "@sentry/nextjs";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  release: process.env.VERCEL_GIT_COMMIT_SHA,
  tracesSampleRate: 0.1,
  // Strip PII at the edge
  beforeSend: (event) => {
    if (event.request?.cookies) delete event.request.cookies;
    if (event.user?.email) event.user.email = "[redacted]";
    return event;
  },
});

PagerDuty integration on Sentry’s error and fatal levels for the production project. Quiet hours not applied; supervisory failures cannot wait until Monday.

5. Uptime monitoring

External monitor (Better Stack or Pingdom) hitting:

GET /api/health (the production canary endpoint, returns 200 only when Postgres is reachable and RLS migration check passes)
GET / (the marketing landing)
GET /demo/principal (a representative app surface)

From at least two regions (UK and EU). Page on three consecutive failures (90 seconds). Alert via the OPS_PAGER_WEBHOOK.

6. Audit-chain integrity job

Cron (/audit-integrity, daily at 03:00 UTC):

// Pseudocode
async function checkAuditChain(tenantId: Ulid) {
  let prevHash = ZERO_HASH;
  for await (const row of streamAuditEvents(tenantId, { order: "asc" })) {
    const expected = sha256(canonicalJson({ ...row, prevHash }));
    if (expected !== row.hash || row.prevHash !== prevHash) {
      pageOps({ severity: "P1", tenantId, atRow: row.id, expected, actual: row.hash });
      return { ok: false, atRow: row.id };
    }
    prevHash = row.hash;
  }
  return { ok: true };
}

A mismatch is P1. The audit log surface displays a banner blocking export until the firm acknowledges.

7. Retention sweeper

Cron (/retention-sweep, daily at 02:00 UTC). Scans tenant-scoped tables for entities past their retention window (see Persona and tenant model) and:

Deletes object-store attachments for soft-deleted parents.
Hard-deletes the parent row.
Writes a retention.sweep audit event per affected entity.
Runs in transactions of 500 rows or fewer to avoid long locks.

The sweeper does not touch audit_events (10-year retention managed by year-shard partitions, dropped only when a whole year ages out).

8. SAR-data export script

A Subject Access Request handler. Given a userId, the script collects:

The User row.
All Session rows.
All AuditEvent rows where actorUserId = userId.
For ar-user: their AR’s BreachReport, FileReview, MIReturn, AnnualReview, ConductEvent rows.

Output is a single ZIP with one CSV per entity plus a README.txt summary. The script runs on demand, not on cron, and writes a gdpr.sar-export audit event.

# pseudo CLI
oversight-admin sar-export \
  --tenant 01HW6V0A... \
  --user   01HW6V8K... \
  --out    sar-01HW6V8K.zip

The script must be available in the production environment (containerised, runnable via vercel deploy --prebuilt to a one-off function or as an ops Lambda invoked by the support team).

9. Backups

Postgres: daily logical backup, hourly continuous WAL archive, 35-day point-in-time recovery, encrypted at rest, geo-redundant.
Object store: versioned bucket, 35-day version retention, replication to a secondary region.
Audit chain: included in Postgres backups; integrity check runs against backups quarterly.

Restore drills quarterly: spin up a restore from yesterday’s backup, run the integrity check, verify a known-good audit event is present.

10. Database migrations

pgmigrate or equivalent, with a CI step:

Lint migrations (no destructive operations without a paired up/down).
Apply against a freshly-seeded test database.
Run the RLS check: every tenant-scoped table has ENABLE ROW LEVEL SECURITY, FORCE ROW LEVEL SECURITY, and a tenant_isolation policy.
Refuse the deploy if any check fails.

Production migrations run on deploy with a 30-second statement timeout. Long migrations (index builds) use CONCURRENTLY and a separate migration window.

11. Step-up auth wiring

Verify the four step-up paths against production keys:

POST /api/breaches/:id/notify-fca
POST /api/annual-reviews/:id/sign-off
PATCH /api/ars/:id (only when transitioning to terminated)
PATCH /api/tenants/me/risk-weights

Each path returns 403 with error.code === "step_up_required" when the step-up token is absent or expired. The UI prompts re-step-up without losing form state.

Curl checks in CI:

A POST without X-CSRF-Token returns 403.
A POST with a stale CSRF token returns 403.
A POST after DELETE /api/auth/session returns 401.
A request with a cookie from the staging environment never authenticates against production.

13. Logging and PII discipline

Application logs go to Vercel’s runtime log stream and a long-term sink. Logs never contain:

Plain-text passwords or TOTP codes
Full session ids (truncated to first 8 chars)
CSRF tokens or step-up tokens
AR-customer PII from breach descriptions or file-review notes (those rows are loaded by id only at the log level; the database is the source of truth)

A pre-deploy linter scans for console.log and console.error calls and fails the build if any log a known-PII field.

14. Operational documentation

Runbooks for:

Audit-chain mismatch
Database failover
Object-store outage
Postmark outage (queue emails locally, replay)
Vercel platform outage (rollback, status-page comms)

Stored in the company wiki. Linked from the Sentry issue templates so the on-call sees them.

Launch gate

Sign-off required from:

Engineering: items 1-6, 10-13
Compliance: items 7, 8, 14
Operations: items 9, 14

A launch checklist in the company wiki tracks each item against the responsible owner and the verification timestamp. Production traffic is enabled only after every item is green.