schemeta/docs/operations-runbook.md

# Schemeta Operations Runbook

This runbook covers baseline production operation for Schemeta API + UI.

## Runtime

- Node.js 18+ recommended.
- Start command: `npm run start`
- Default bind: `0.0.0.0:8787`

## Environment Variables

- `PORT` (default `8787`)
- `MAX_BODY_BYTES` (default `2097152`)
  - Hard limit for request body size on `POST` endpoints.
- `MAX_REQUESTS_PER_MINUTE` (default `120`)
  - Per-client IP rate limit window for `POST` endpoints.
- `SCHEMETA_AUTH_TOKEN` (optional)
  - When set, all `POST` API routes require either:
    - `Authorization: Bearer <token>`
    - `x-api-key: <token>`
- `CORS_ORIGIN` (optional)
  - If set, CORS is enabled for this origin only.

## Endpoints

- `GET /health`
  - Liveness probe, returns process uptime and status.
- `GET /`
  - Serves workspace UI.
- `POST /compile`
  - Compile + render with ERC/diagnostics and layout metrics.
- `POST /analyze`
  - Topology and diagnostics summary.
- `GET /mcp/ui-bundle`
  - Metadata for MCP UI embedding.

## Request Correlation and Audit Logs

- Every response includes `x-request-id`.
- API envelopes include `request_id` for correlation in clients and logs.
- Server emits one JSON audit log entry per request on response finish with:
  - `request_id`
  - `method`
  - `path`
  - `status`
  - `duration_ms`
  - `client`

## Production Checks

1. Verify process liveness:
   - `curl -s http://localhost:${PORT:-8787}/health`
2. Verify compile endpoint:
   - post `frontend/sample.schemeta.json` to `/compile`.
3. Verify analyze endpoint:
   - post same sample to `/analyze`.
4. Verify rate limiting:
   - exceed `MAX_REQUESTS_PER_MINUTE` with repeated `POST` and confirm `429`.
5. Verify auth (if enabled):
   - request `POST /compile` without token and confirm `401`.
   - request with valid token and confirm `200`.

## Incident Playbook

## High error rate (5xx)

1. Check process logs for stack traces and malformed payload spikes.
2. Validate request body sizes; lower/raise `MAX_BODY_BYTES` as appropriate.
3. Reproduce with `frontend/sample.schemeta.json` to isolate model-driven payload issues.
4. Roll back to previous known-good tag if regression confirmed.

## Elevated 429 responses

1. Confirm traffic source and whether bursts are expected.
2. If trusted internal clients are throttled, tune `MAX_REQUESTS_PER_MINUTE`.
3. Consider fronting with reverse proxy rate limit tiers for external users.

## UI/compile mismatch reports

1. Capture JSON from user (`Copy Repro` in workspace).
2. Re-run through `/compile` and inspect `warnings`, `errors`, and `layout_metrics`.
3. Compare with last release baseline for crossing/overlap regressions.

## Release / Rollback

1. Follow `docs/release-checklist.md`.
2. Tag releases after checklist completion and test pass.
3. Keep previous stable tag ready for fast rollback.

## Observability Recommendations

- Structured request logs are emitted by the app; keep proxy logs for edge-level traces.
- Track latency percentiles for `/compile` and `/analyze`.
- Track per-endpoint status code rates and top warning/error IDs.