schemeta/docs/operations-runbook.md
Rbanh 31a47346ea
Some checks are pending
CI / test (push) Waiting to run
Harden API contracts with request IDs and audit telemetry
2026-02-18 22:19:38 -05:00

3.0 KiB

Schemeta Operations Runbook

This runbook covers baseline production operation for Schemeta API + UI.

Runtime

  • Node.js 18+ recommended.
  • Start command: npm run start
  • Default bind: 0.0.0.0:8787

Environment Variables

  • PORT (default 8787)
  • MAX_BODY_BYTES (default 2097152)
    • Hard limit for request body size on POST endpoints.
  • MAX_REQUESTS_PER_MINUTE (default 120)
    • Per-client IP rate limit window for POST endpoints.
  • SCHEMETA_AUTH_TOKEN (optional)
    • When set, all POST API routes require either:
      • Authorization: Bearer <token>
      • x-api-key: <token>
  • CORS_ORIGIN (optional)
    • If set, CORS is enabled for this origin only.

Endpoints

  • GET /health
    • Liveness probe, returns process uptime and status.
  • GET /
    • Serves workspace UI.
  • POST /compile
    • Compile + render with ERC/diagnostics and layout metrics.
  • POST /analyze
    • Topology and diagnostics summary.
  • GET /mcp/ui-bundle
    • Metadata for MCP UI embedding.

Request Correlation and Audit Logs

  • Every response includes x-request-id.
  • API envelopes include request_id for correlation in clients and logs.
  • Server emits one JSON audit log entry per request on response finish with:
    • request_id
    • method
    • path
    • status
    • duration_ms
    • client

Production Checks

  1. Verify process liveness:
    • curl -s http://localhost:${PORT:-8787}/health
  2. Verify compile endpoint:
    • post frontend/sample.schemeta.json to /compile.
  3. Verify analyze endpoint:
    • post same sample to /analyze.
  4. Verify rate limiting:
    • exceed MAX_REQUESTS_PER_MINUTE with repeated POST and confirm 429.
  5. Verify auth (if enabled):
    • request POST /compile without token and confirm 401.
    • request with valid token and confirm 200.

Incident Playbook

High error rate (5xx)

  1. Check process logs for stack traces and malformed payload spikes.
  2. Validate request body sizes; lower/raise MAX_BODY_BYTES as appropriate.
  3. Reproduce with frontend/sample.schemeta.json to isolate model-driven payload issues.
  4. Roll back to previous known-good tag if regression confirmed.

Elevated 429 responses

  1. Confirm traffic source and whether bursts are expected.
  2. If trusted internal clients are throttled, tune MAX_REQUESTS_PER_MINUTE.
  3. Consider fronting with reverse proxy rate limit tiers for external users.

UI/compile mismatch reports

  1. Capture JSON from user (Copy Repro in workspace).
  2. Re-run through /compile and inspect warnings, errors, and layout_metrics.
  3. Compare with last release baseline for crossing/overlap regressions.

Release / Rollback

  1. Follow docs/release-checklist.md.
  2. Tag releases after checklist completion and test pass.
  3. Keep previous stable tag ready for fast rollback.

Observability Recommendations

  • Structured request logs are emitted by the app; keep proxy logs for edge-level traces.
  • Track latency percentiles for /compile and /analyze.
  • Track per-endpoint status code rates and top warning/error IDs.