Athena

Cluster Operations

Health, metrics, and runbook-level Athena operations guidance.

Health Endpoints

  • GET /ping for lightweight liveness checks
  • GET /health/cluster for per-client connectivity and latency data

Observability Endpoints

  • GET /metrics for Prometheus scrape output
  • GET /router/registry for live route registry
  • client stats and drilldown routes under /admin/clients/*

Trace and log sinks

When tracing file sinks are enabled, Athena splits log output by severity:

  • non-ERROR events into success.log
  • ERROR events into error.log

Gateway requests also emit structured athena_rs::gateway_trace events. These events are designed for fast incident triage and include request identity, operation/table, status/outcome, duration, backend/cache hints, trace IDs, and row counters in one line.

  1. Verify logging and auth clients are configured.
  2. Confirm route contract availability (/openapi.yaml, /openapi-wss.yaml).
  3. Track queue depth and job statuses for deferred/backup workflows.
  4. Inspect query optimization and vacuum health routes periodically.

Environment Baselines

  • Ensure ATHENA_CONFIG_PATH is explicit in production deployments.
  • Confirm Postgres tools are available (pg_dump, pg_restore).
  • Configure Redis and S3-compatible settings where required.

Failure Strategy

  • Use deferred queue for transient gateway pressure.
  • Use backup restore APIs for controlled recovery.
  • Keep audit/logging client healthy for governance visibility.