FerryAPI

Quota and developer experience

AI API Rate Limit Headers for OpenAI-Compatible Gateways

A practical guide to rate limit headers for SaaS teams running OpenAI-compatible AI gateways: expose quotas, retry timing, model limits, prepaid balance risk, and support-friendly request IDs.

Why rate limit headers matter for AI products

AI API limits are harder to explain than ordinary REST limits. A single request can use different models, stream for minutes, trigger retries, call tools, reserve prepaid balance, and settle a different final cost than the estimate. If the gateway only returns a generic 429, developers cannot tell whether they hit a per-minute abuse limit, a monthly plan cap, a model-tier rule, or a low-balance block.

Good headers turn quota enforcement into an integration contract. They help SDKs back off safely, help customers understand plan usage, and give support teams a shared request ID when billing or reliability questions appear.

Recommended header set

HeaderPurposeExample
X-RateLimit-LimitMaximum requests allowed in the active short window.60
X-RateLimit-RemainingRequests remaining before the short-window throttle blocks new calls.17
X-RateLimit-ResetUnix timestamp when the short request window resets.1780572210
Retry-AfterSeconds the client should wait after a 429 or temporary gateway throttle.12
X-AI-Quota-PolicyHuman-readable policy name used for the decision.starter-monthly-usd-cap
X-AI-Quota-Remaining-UsdEstimated remaining plan or prepaid allowance after the current reservation.42.80
X-AI-Model-TierThe approved model tier or route class after policy evaluation.standard
X-FerryAPI-Request-IdCanonical request ID for support, logs, and usage ledger correlation.req_8f3c...

Separate request throttles from spend controls

Do not overload one header to describe every limit. Request-per-minute throttles protect infrastructure and abuse boundaries. Spend and quota controls protect customer budgets. Model-tier policies protect premium provider routes. They should be evaluated together, but exposed clearly enough that a client can choose the right response.

For example, a browser-issued customer key may still have monthly budget available but hit a per-key burst limit. In that case the SDK should wait and retry. If the tenant has no prepaid balance left, the SDK should not retry immediately; the product should ask an admin to top up or downgrade the workload.

Example 429 response

HTTP/1.1 429 Too Many Requests
Retry-After: 12
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1780572210
X-AI-Quota-Policy: browser-key-burst-limit
X-AI-Model-Tier: standard
X-FerryAPI-Request-Id: req_8f3c2a

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "This customer API key has reached its short-window request limit. Retry after 12 seconds.",
    "request_id": "req_8f3c2a"
  }
}

How this connects to quotas, retries, and idempotency

Rate limit headers should line up with your AI API quota policy, not contradict it. If the gateway blocks a premium model route because the tenant has no remaining allowance, say that through the policy name and a clear error type.

They also need to work with retry controls. A gateway retry budget policy should respect Retry-After and avoid turning throttled calls into provider-spend storms. For client-side retries, pair headers with idempotency keys so a timeout or reconnect does not create duplicate billable work.

Implementation checklist

How FerryAPI helps

FerryAPI is an OpenAI-compatible AI API gateway for model routing, customer API keys, quotas, prepaid balances, and usage billing. Centralizing rate limit headers in the gateway lets SaaS teams expose predictable developer experience while keeping model spend, retries, and billing records under control.

Need quota responses developers can actually use?
Use FerryAPI to centralize AI API rate limits, customer-key policies, retry timing, prepaid balance checks, and billing-ready request IDs.