Cluster Operations
Health, metrics, and runbook-level Athena operations guidance.
Health Endpoints
GET /pingfor lightweight liveness checksGET /health/clusterfor per-client connectivity and latency data
Observability Endpoints
GET /metricsfor Prometheus scrape outputGET /router/registryfor live route registry- client stats and drilldown routes under
/admin/clients/*
Trace and log sinks
When tracing file sinks are enabled, Athena splits log output by severity:
- non-
ERRORevents intosuccess.log ERRORevents intoerror.log
Gateway requests also emit structured athena_rs::gateway_trace events. These
events are designed for fast incident triage and include request identity,
operation/table, status/outcome, duration, backend/cache hints, trace IDs, and
row counters in one line.
Recommended Operational Checks
- Verify logging and auth clients are configured.
- Confirm route contract availability (
/openapi.yaml,/openapi-wss.yaml). - Track queue depth and job statuses for deferred, backup, and clone workflows.
- Inspect query optimization and vacuum health routes periodically.
- Verify API and daemon ownership are deployed as intended.
Environment Baselines
- Ensure
ATHENA_CONFIG_PATHis explicit in production deployments. - Confirm Postgres tools are available (
pg_dump,pg_restore). - Configure Redis and S3-compatible settings where required.
- Enable
daemon.enabledonly when daemon-owned workers are deployed and supervised outsideathena_rs. - Remember that clone jobs never execute in
athena_rs; they requireathena_daemonorathena_clone_worker.
API And Daemon Split
athena_rsis always the main HTTP and CLI process.athena-daemonis the bundled worker runtime for clone execution plus migrated workers.athena_clone_workeris the only supported "split one worker family first" runtime.- Backup, deferred, Typesense, and legacy dedicated binaries are for full worker-per-process layouts or focused debugging, not partial inline replacement.
- All runtime processes should share the same control-plane or logging database.
Failure Strategy
- Use deferred queue for transient gateway pressure.
- Use backup restore APIs for controlled recovery.
- Keep audit/logging client healthy for governance visibility.
- Use clone job event streams and daemon ids to localize failed clone execution quickly.