schemeta/docs/operations-runbook.md
Rbanh 31a47346ea
Some checks are pending
CI / test (push) Waiting to run
Harden API contracts with request IDs and audit telemetry
2026-02-18 22:19:38 -05:00

96 lines
3.0 KiB
Markdown

# Schemeta Operations Runbook
This runbook covers baseline production operation for Schemeta API + UI.
## Runtime
- Node.js 18+ recommended.
- Start command: `npm run start`
- Default bind: `0.0.0.0:8787`
## Environment Variables
- `PORT` (default `8787`)
- `MAX_BODY_BYTES` (default `2097152`)
- Hard limit for request body size on `POST` endpoints.
- `MAX_REQUESTS_PER_MINUTE` (default `120`)
- Per-client IP rate limit window for `POST` endpoints.
- `SCHEMETA_AUTH_TOKEN` (optional)
- When set, all `POST` API routes require either:
- `Authorization: Bearer <token>`
- `x-api-key: <token>`
- `CORS_ORIGIN` (optional)
- If set, CORS is enabled for this origin only.
## Endpoints
- `GET /health`
- Liveness probe, returns process uptime and status.
- `GET /`
- Serves workspace UI.
- `POST /compile`
- Compile + render with ERC/diagnostics and layout metrics.
- `POST /analyze`
- Topology and diagnostics summary.
- `GET /mcp/ui-bundle`
- Metadata for MCP UI embedding.
## Request Correlation and Audit Logs
- Every response includes `x-request-id`.
- API envelopes include `request_id` for correlation in clients and logs.
- Server emits one JSON audit log entry per request on response finish with:
- `request_id`
- `method`
- `path`
- `status`
- `duration_ms`
- `client`
## Production Checks
1. Verify process liveness:
- `curl -s http://localhost:${PORT:-8787}/health`
2. Verify compile endpoint:
- post `frontend/sample.schemeta.json` to `/compile`.
3. Verify analyze endpoint:
- post same sample to `/analyze`.
4. Verify rate limiting:
- exceed `MAX_REQUESTS_PER_MINUTE` with repeated `POST` and confirm `429`.
5. Verify auth (if enabled):
- request `POST /compile` without token and confirm `401`.
- request with valid token and confirm `200`.
## Incident Playbook
## High error rate (5xx)
1. Check process logs for stack traces and malformed payload spikes.
2. Validate request body sizes; lower/raise `MAX_BODY_BYTES` as appropriate.
3. Reproduce with `frontend/sample.schemeta.json` to isolate model-driven payload issues.
4. Roll back to previous known-good tag if regression confirmed.
## Elevated 429 responses
1. Confirm traffic source and whether bursts are expected.
2. If trusted internal clients are throttled, tune `MAX_REQUESTS_PER_MINUTE`.
3. Consider fronting with reverse proxy rate limit tiers for external users.
## UI/compile mismatch reports
1. Capture JSON from user (`Copy Repro` in workspace).
2. Re-run through `/compile` and inspect `warnings`, `errors`, and `layout_metrics`.
3. Compare with last release baseline for crossing/overlap regressions.
## Release / Rollback
1. Follow `docs/release-checklist.md`.
2. Tag releases after checklist completion and test pass.
3. Keep previous stable tag ready for fast rollback.
## Observability Recommendations
- Structured request logs are emitted by the app; keep proxy logs for edge-level traces.
- Track latency percentiles for `/compile` and `/analyze`.
- Track per-endpoint status code rates and top warning/error IDs.