Rbanh/schemeta

Rbanh 31a47346ea

CI / test (push) Waiting to run

Details

Harden API contracts with request IDs and audit telemetry

2026-02-18 22:19:38 -05:00

3.0 KiB

Raw Blame History

Schemeta Operations Runbook

This runbook covers baseline production operation for Schemeta API + UI.

Runtime

Node.js 18+ recommended.
Start command: npm run start
Default bind: 0.0.0.0:8787

Environment Variables

PORT (default 8787)
MAX_BODY_BYTES (default 2097152)
- Hard limit for request body size on POST endpoints.
MAX_REQUESTS_PER_MINUTE (default 120)
- Per-client IP rate limit window for POST endpoints.
SCHEMETA_AUTH_TOKEN (optional)
- When set, all POST API routes require either:
  - Authorization: Bearer <token>
  - x-api-key: <token>
CORS_ORIGIN (optional)
- If set, CORS is enabled for this origin only.

Endpoints

GET /health
- Liveness probe, returns process uptime and status.
GET /
- Serves workspace UI.
POST /compile
- Compile + render with ERC/diagnostics and layout metrics.
POST /analyze
- Topology and diagnostics summary.
GET /mcp/ui-bundle
- Metadata for MCP UI embedding.

Request Correlation and Audit Logs

Every response includes x-request-id.
API envelopes include request_id for correlation in clients and logs.
Server emits one JSON audit log entry per request on response finish with:
- request_id
- method
- path
- status
- duration_ms
- client

Production Checks

Verify process liveness:
- curl -s http://localhost:${PORT:-8787}/health
Verify compile endpoint:
- post frontend/sample.schemeta.json to /compile.
Verify analyze endpoint:
- post same sample to /analyze.
Verify rate limiting:
- exceed MAX_REQUESTS_PER_MINUTE with repeated POST and confirm 429.
Verify auth (if enabled):
- request POST /compile without token and confirm 401.
- request with valid token and confirm 200.

Incident Playbook

High error rate (5xx)

Check process logs for stack traces and malformed payload spikes.
Validate request body sizes; lower/raise MAX_BODY_BYTES as appropriate.
Reproduce with frontend/sample.schemeta.json to isolate model-driven payload issues.
Roll back to previous known-good tag if regression confirmed.

Elevated 429 responses

Confirm traffic source and whether bursts are expected.
If trusted internal clients are throttled, tune MAX_REQUESTS_PER_MINUTE.
Consider fronting with reverse proxy rate limit tiers for external users.

UI/compile mismatch reports

Capture JSON from user (Copy Repro in workspace).
Re-run through /compile and inspect warnings, errors, and layout_metrics.
Compare with last release baseline for crossing/overlap regressions.

Release / Rollback

Follow docs/release-checklist.md.
Tag releases after checklist completion and test pass.
Keep previous stable tag ready for fast rollback.

Observability Recommendations

Structured request logs are emitted by the app; keep proxy logs for edge-level traces.
Track latency percentiles for /compile and /analyze.
Track per-endpoint status code rates and top warning/error IDs.