Production hardening
V1 is a marketing demo running on Vercel with no auth, no database, no secrets. This page is the runbook for promoting Oversight from demo to production. None of the items below are required to ship the marketing surface; all are required before a regulated firm submits real AR data.
Pre-launch checklist
Section titled “Pre-launch checklist”Tick top to bottom. Every item is gated by the previous one.
1. Environment variables
Section titled “1. Environment variables”Set in the Vercel project’s Production environment. Never check secrets into the repo.
| Variable | Purpose | Rotation |
|---|---|---|
DATABASE_URL | Postgres connection string. Read replicas configured separately. | Rotate on suspected compromise; otherwise yearly. |
SESSION_SIGNING_KEY | HMAC-SHA256 key for signing session cookie payloads (defence-in-depth alongside the opaque session id). | Quarterly. Rotate by issuing a new key, accept both old and new for the rotation window, then drop the old. |
CSRF_SIGNING_KEY | HMAC key for double-submit CSRF tokens. | Quarterly. |
STEP_UP_SIGNING_KEY | HMAC key for short-lived step-up tokens. | Quarterly. |
POSTMARK_TOKEN | Email delivery (invitations, password resets, deadline alerts). | Yearly. |
OBJECT_STORE_BUCKET | Bucket name (e.g. oversight-prod-attachments). | Static. |
OBJECT_STORE_REGION | Region of the object store (must satisfy data residency, default eu-west-2). | Static. |
OBJECT_STORE_ACCESS_KEY_ID, OBJECT_STORE_SECRET_ACCESS_KEY | IAM credentials with bucket-prefixed permissions only. | Quarterly. |
SENTRY_DSN | Error monitoring. | Static; rotate on key compromise. |
SENTRY_AUTH_TOKEN | Source-map upload during build. | Yearly. |
RATE_LIMIT_REDIS_URL | Token-bucket store. | Rotate on compromise. |
FEATURE_FLAGS_API_KEY | (Optional) feature-flag service. | Per provider policy. |
OPS_PAGER_WEBHOOK | PagerDuty / Opsgenie webhook for P1 incidents. | Yearly. |
Set distinct values for Preview and Development environments. Preview should never connect to the production database.
2. Rate limits
Section titled “2. Rate limits”Configured at the edge (Vercel middleware) and in the API handlers. The token-bucket store is Redis. Limits per route in API routes. Verify before launch:
- Auth routes (
POST /api/auth/session): 5/min/IP, with 10-fail-per-email-per-hour lockout. - Step-up (
POST /api/auth/step-up): 10/min. - Notify-FCA (
POST /api/breaches/:id/notify-fca): 10/min. - All other write routes: as documented.
429 responses include Retry-After and never leak account state.
3. CSP and security headers
Section titled “3. CSP and security headers”Set via next.config.ts:
const securityHeaders = [ { key: "Content-Security-Policy", value: [ "default-src 'self'", "script-src 'self' 'unsafe-inline'", // Inline scripts for Next.js bootstrap; tighten with nonces in v2 "style-src 'self' 'unsafe-inline'", "img-src 'self' data: https:", "font-src 'self' data:", "connect-src 'self' https://o.sentry.io", "frame-ancestors 'none'", "base-uri 'self'", "form-action 'self'", ].join("; "), }, { key: "Strict-Transport-Security", value: "max-age=63072000; includeSubDomains; preload" }, { key: "X-Content-Type-Options", value: "nosniff" }, { key: "X-Frame-Options", value: "DENY" }, { key: "Referrer-Policy", value: "strict-origin-when-cross-origin" }, { key: "Permissions-Policy", value: "camera=(), microphone=(), geolocation=()" },];Add nonce-based CSP in v2 once Next.js’s nonce support is stable across the app’s surface.
4. Error monitoring
Section titled “4. Error monitoring”Sentry on both client and server. Source maps uploaded during build via SENTRY_AUTH_TOKEN. Release tagged with the deployed commit SHA.
// instrumentation.ts (Next.js convention)import * as Sentry from "@sentry/nextjs";
Sentry.init({ dsn: process.env.SENTRY_DSN, release: process.env.VERCEL_GIT_COMMIT_SHA, tracesSampleRate: 0.1, // Strip PII at the edge beforeSend: (event) => { if (event.request?.cookies) delete event.request.cookies; if (event.user?.email) event.user.email = "[redacted]"; return event; },});PagerDuty integration on Sentry’s error and fatal levels for the production project. Quiet hours not applied; supervisory failures cannot wait until Monday.
5. Uptime monitoring
Section titled “5. Uptime monitoring”External monitor (Better Stack or Pingdom) hitting:
GET /api/health(the production canary endpoint, returns 200 only when Postgres is reachable and RLS migration check passes)GET /(the marketing landing)GET /demo/principal(a representative app surface)
From at least two regions (UK and EU). Page on three consecutive failures (90 seconds). Alert via the OPS_PAGER_WEBHOOK.
6. Audit-chain integrity job
Section titled “6. Audit-chain integrity job”Cron (/audit-integrity, daily at 03:00 UTC):
// Pseudocodeasync function checkAuditChain(tenantId: Ulid) { let prevHash = ZERO_HASH; for await (const row of streamAuditEvents(tenantId, { order: "asc" })) { const expected = sha256(canonicalJson({ ...row, prevHash })); if (expected !== row.hash || row.prevHash !== prevHash) { pageOps({ severity: "P1", tenantId, atRow: row.id, expected, actual: row.hash }); return { ok: false, atRow: row.id }; } prevHash = row.hash; } return { ok: true };}A mismatch is P1. The audit log surface displays a banner blocking export until the firm acknowledges.
7. Retention sweeper
Section titled “7. Retention sweeper”Cron (/retention-sweep, daily at 02:00 UTC). Scans tenant-scoped tables for entities past their retention window (see Persona and tenant model) and:
- Deletes object-store attachments for soft-deleted parents.
- Hard-deletes the parent row.
- Writes a
retention.sweepaudit event per affected entity. - Runs in transactions of 500 rows or fewer to avoid long locks.
The sweeper does not touch audit_events (10-year retention managed by year-shard partitions, dropped only when a whole year ages out).
8. SAR-data export script
Section titled “8. SAR-data export script”A Subject Access Request handler. Given a userId, the script collects:
- The
Userrow. - All
Sessionrows. - All
AuditEventrows whereactorUserId = userId. - For
ar-user: their AR’sBreachReport,FileReview,MIReturn,AnnualReview,ConductEventrows.
Output is a single ZIP with one CSV per entity plus a README.txt summary. The script runs on demand, not on cron, and writes a gdpr.sar-export audit event.
# pseudo CLIoversight-admin sar-export \ --tenant 01HW6V0A... \ --user 01HW6V8K... \ --out sar-01HW6V8K.zipThe script must be available in the production environment (containerised, runnable via vercel deploy --prebuilt to a one-off function or as an ops Lambda invoked by the support team).
9. Backups
Section titled “9. Backups”- Postgres: daily logical backup, hourly continuous WAL archive, 35-day point-in-time recovery, encrypted at rest, geo-redundant.
- Object store: versioned bucket, 35-day version retention, replication to a secondary region.
- Audit chain: included in Postgres backups; integrity check runs against backups quarterly.
Restore drills quarterly: spin up a restore from yesterday’s backup, run the integrity check, verify a known-good audit event is present.
10. Database migrations
Section titled “10. Database migrations”pgmigrate or equivalent, with a CI step:
- Lint migrations (no destructive operations without a paired up/down).
- Apply against a freshly-seeded test database.
- Run the RLS check: every tenant-scoped table has
ENABLE ROW LEVEL SECURITY,FORCE ROW LEVEL SECURITY, and atenant_isolationpolicy. - Refuse the deploy if any check fails.
Production migrations run on deploy with a 30-second statement timeout. Long migrations (index builds) use CONCURRENTLY and a separate migration window.
11. Step-up auth wiring
Section titled “11. Step-up auth wiring”Verify the four step-up paths against production keys:
POST /api/breaches/:id/notify-fcaPOST /api/annual-reviews/:id/sign-offPATCH /api/ars/:id(only when transitioning toterminated)PATCH /api/tenants/me/risk-weights
Each path returns 403 with error.code === "step_up_required" when the step-up token is absent or expired. The UI prompts re-step-up without losing form state.
12. CSRF and session-cookie integrity
Section titled “12. CSRF and session-cookie integrity”Curl checks in CI:
- A POST without
X-CSRF-Tokenreturns 403. - A POST with a stale CSRF token returns 403.
- A POST after
DELETE /api/auth/sessionreturns 401. - A request with a cookie from the staging environment never authenticates against production.
13. Logging and PII discipline
Section titled “13. Logging and PII discipline”Application logs go to Vercel’s runtime log stream and a long-term sink. Logs never contain:
- Plain-text passwords or TOTP codes
- Full session ids (truncated to first 8 chars)
- CSRF tokens or step-up tokens
- AR-customer PII from breach descriptions or file-review notes (those rows are loaded by id only at the log level; the database is the source of truth)
A pre-deploy linter scans for console.log and console.error calls and fails the build if any log a known-PII field.
14. Operational documentation
Section titled “14. Operational documentation”Runbooks for:
- Audit-chain mismatch
- Database failover
- Object-store outage
- Postmark outage (queue emails locally, replay)
- Vercel platform outage (rollback, status-page comms)
Stored in the company wiki. Linked from the Sentry issue templates so the on-call sees them.
Launch gate
Section titled “Launch gate”Sign-off required from:
- Engineering: items 1-6, 10-13
- Compliance: items 7, 8, 14
- Operations: items 9, 14
A launch checklist in the company wiki tracks each item against the responsible owner and the verification timestamp. Production traffic is enabled only after every item is green.