Gateway Pool Starvation And Auth Failures

Audience

Operators debugging intermittent pool timed out while waiting for an open connection
Contributors changing gateway logging, auth, or catalog queries

See architecture_gateway_client_stats_batcher.md for the batched write path that mitigates this failure mode.

Symptom Chain

Under heavy load, logs may show:

sqlx::pool::acquire slow acquire warnings
failed to look up api key / failed to load effective api key enforcement with pool timeout
HTTP 503 with messages such as API key validation unavailable or API key policy unavailable

Those responses are emitted when the auth layer cannot obtain a pooled connection within the configured acquire timeout, not necessarily because the key is invalid.

Gateway work splits into a synchronous hot path (must complete quickly) and spawned background work (logging and telemetry). When both compete for the same PgPool, the pool becomes the shared bottleneck.

When gateway.auth_client and gateway.logging_client resolve to the same database connection target, PoolShared and contention on catalog tables amplify each other: many tasks hold a connection while waiting on row locks from concurrent INSERT ... ON CONFLICT DO UPDATE statements targeting the same counters.

Contention Mechanism

For one busy logical client name, every request performs counter upserts that map to one row in client_statistics and often one row per (client_name, table_name, operation) in client_table_statistics. PostgreSQL serializes conflicting updates on the same row.

Pool Configuration Context

Pool size and acquire timeout are configured via runtime environment (see RuntimeEnvSettings in src/config_validation.rs and connection_pool_manager_from_env in src/bootstrap/postgres_init.rs). Raising limits alone does not remove row-level serialization; it only postpones starvation under higher load.

Mitigation Summary

Primary: Coalesce counter and last_seen writes through a bounded in-process batcher so each flush window performs far fewer statements per hot key (see the companion architecture doc).
Guard best-effort writers: Use the shared AppState.logging_task_limiter for spawned logging/audit tasks (gateway_*_log, auth attempt logs, API-key last-used touch, admission-event persistence) so telemetry cannot consume all pool slots under burst load.
Operational: Ensure Neon or self-hosted Postgres max_connections is consistent with the sum of per-client pool ceilings configured for Athena.
Not recommended as the fix: fail_open auth mode weakens enforcement; prefer relieving pool and lock pressure first.

Logging Task Limiter (post-batcher hardening)

Even with batched client_statistics, Athena still performs several append-style writes asynchronously. To prevent unbounded spawn fanout from pinning the shared logging pool at max occupancy, spawned telemetry now routes through spawn_best_effort_logging_task (src/utils/logging_task_limiter.rs).

Limiter source: AppState.logging_task_limiter (created in src/bootstrap/mod.rs).
Sizing policy: derived from pool max as max(1, pool_max/4), capped at 16.
Behavior at saturation: drop best-effort telemetry task and emit a warning, rather than queueing and extending connection pressure.

This is intentionally a degradation policy: core request handling and auth should retain pool headroom, while non-critical telemetry may be sampled/dropped during spikes.