Cluster Operations

Health Endpoints

GET /ping for lightweight liveness checks
GET /health/cluster for per-client connectivity and latency data

Observability Endpoints

GET /metrics for Prometheus scrape output
GET /router/registry for live route registry
client stats and drilldown routes under /admin/clients/*

Trace and log sinks

When tracing file sinks are enabled, Athena splits log output by severity:

non-ERROR events into success.log
ERROR events into error.log

Gateway requests also emit structured athena_rs::gateway_trace events. These events are designed for fast incident triage and include request identity, operation/table, status/outcome, duration, backend/cache hints, trace IDs, and row counters in one line.

Recommended Operational Checks

Verify logging and auth clients are configured.
Confirm route contract availability (/openapi.yaml, /openapi-wss.yaml).
Track queue depth and job statuses for deferred, backup, and clone workflows.
Inspect query optimization and vacuum health routes periodically.
Verify API and daemon ownership are deployed as intended.

Environment Baselines

Ensure ATHENA_CONFIG_PATH is explicit in production deployments.
Confirm Postgres tools are available (pg_dump, pg_restore).
Configure Redis and S3-compatible settings where required.
Enable daemon.enabled only when daemon-owned workers are deployed and supervised outside athena_rs.
Remember that clone jobs never execute in athena_rs; they require athena_daemon or athena_clone_worker.

API And Daemon Split

athena_rs is always the main HTTP and CLI process.
athena-daemon is the bundled worker runtime for clone execution plus migrated workers.
athena_clone_worker is the only supported "split one worker family first" runtime.
Backup, deferred, Typesense, and legacy dedicated binaries are for full worker-per-process layouts or focused debugging, not partial inline replacement.
All runtime processes should share the same control-plane or logging database.

Failure Strategy

Use deferred queue for transient gateway pressure.
Use backup restore APIs for controlled recovery.
Keep audit/logging client healthy for governance visibility.
Use clone job event streams and daemon ids to localize failed clone execution quickly.