Gateway Pool Starvation And Auth Failures
How high gateway throughput against a single instance can exhaust the pool
Audience
- Operators debugging intermittent
pool timed out while waiting for an open connection - Contributors changing gateway logging, auth, or catalog queries
Related Doc
See architecture_gateway_client_stats_batcher.md for the batched write path that mitigates this failure mode.
Symptom Chain
Under heavy load, logs may show:
sqlx::pool::acquireslow acquire warningsfailed to look up api key/failed to load effective api key enforcementwith pool timeout- HTTP 503 with messages such as
API key validation unavailableorAPI key policy unavailable
Those responses are emitted when the auth layer cannot obtain a pooled connection within the configured acquire timeout, not necessarily because the key is invalid.
Request Topology
Gateway work splits into a synchronous hot path (must complete quickly) and spawned background work (logging and telemetry). When both compete for the same PgPool, the pool becomes the shared bottleneck.
When gateway.auth_client and gateway.logging_client resolve to the same database connection target, PoolShared and contention on catalog tables amplify each other: many tasks hold a connection while waiting on row locks from concurrent INSERT ... ON CONFLICT DO UPDATE statements targeting the same counters.
Contention Mechanism
For one busy logical client name, every request performs counter upserts that map to one row in client_statistics and often one row per (client_name, table_name, operation) in client_table_statistics. PostgreSQL serializes conflicting updates on the same row.
Pool Configuration Context
Pool size and acquire timeout are configured via runtime environment (see RuntimeEnvSettings in src/config_validation.rs and connection_pool_manager_from_env in src/bootstrap/postgres_init.rs). Raising limits alone does not remove row-level serialization; it only postpones starvation under higher load.
Mitigation Summary
- Primary: Coalesce counter and
last_seenwrites through a bounded in-process batcher so each flush window performs far fewer statements per hot key (see the companion architecture doc). - Guard best-effort writers: Use the shared
AppState.logging_task_limiterfor spawned logging/audit tasks (gateway_*_log, auth attempt logs, API-key last-used touch, admission-event persistence) so telemetry cannot consume all pool slots under burst load. - Operational: Ensure Neon or self-hosted Postgres max_connections is consistent with the sum of per-client pool ceilings configured for Athena.
- Not recommended as the fix:
fail_openauth mode weakens enforcement; prefer relieving pool and lock pressure first.
Logging Task Limiter (post-batcher hardening)
Even with batched client_statistics, Athena still performs several append-style writes asynchronously. To prevent unbounded spawn fanout from pinning the shared logging pool at max occupancy, spawned telemetry now routes through spawn_best_effort_logging_task (src/utils/logging_task_limiter.rs).
- Limiter source:
AppState.logging_task_limiter(created insrc/bootstrap/mod.rs). - Sizing policy: derived from pool max as
max(1, pool_max/4), capped at16. - Behavior at saturation: drop best-effort telemetry task and emit a warning, rather than queueing and extending connection pressure.
This is intentionally a degradation policy: core request handling and auth should retain pool headroom, while non-critical telemetry may be sampled/dropped during spikes.