Reliability and cost control

OpenAI-Compatible Gateway Retry Budget Policy for SaaS Teams

A practical retry budget policy for SaaS teams running OpenAI-compatible AI gateways: control retries, fallback attempts, streaming timeouts, duplicate billing, and provider spend.

Why retry budgets matter for AI APIs

Retries are useful until they become invisible spend. A normal SaaS API retry may cost a little latency. An AI API retry can trigger another model call, another streaming session, another provider invoice line, and sometimes a different fallback model with a different price. Without a retry budget, one unstable route can multiply cost before anyone notices.

An OpenAI-compatible gateway should treat retries as a policy decision, not a low-level HTTP habit. The gateway has the context needed to know the customer key, tenant budget, model tier, provider route, fallback plan, request size, and whether the first attempt might still complete.

Policy controls to define

Control	Example policy	Why it matters
Maximum attempts	One primary attempt plus one fallback attempt for paid tenants; no fallback for free trials.	Prevents a single request from consuming several provider calls.
Retryable errors	Retry timeouts, 429s with safe backoff, and provider unavailable errors; do not retry auth, quota, content policy, or invalid request errors.	Avoids repeating failures that a second provider call cannot fix.
Streaming timeout rule	Do not retry after tokens have already streamed to the client unless the request is explicitly resumable.	Reduces duplicate answers and duplicate billing disputes.
Cost ceiling	Stop retry/fallback once estimated total request cost exceeds the tenant or feature limit.	Keeps reliability logic aligned with quota and prepaid balance policy.
Idempotency key	Require client idempotency keys for background jobs, agents, and payment-adjacent workflows.	Makes duplicate request detection auditable.

Recommended retry decision order

Classify the failure. Normalize provider errors before deciding whether another attempt is allowed.
Check whether output already reached the user. If streaming began, prefer surfacing a partial-result status instead of silently replaying.
Estimate the next attempt cost. Include model price, prompt size, expected output, and any fallback model premium.
Check tenant and key budgets. Apply the same rules used by your AI API quota policy.
Reserve balance before retrying. Prepaid systems should reserve estimated cost before the second provider call.
Record every attempt. The usage ledger should show primary, retry, fallback, refund, and final billable status.

Separate retry from fallback

A retry uses the same route again because the first attempt probably failed transiently. A fallback changes provider or model because the original path is unhealthy or unavailable. Mixing the two creates confusing bills: teams cannot tell whether spend increased because of network instability, provider outage, or a policy choice to switch models.

Keep both concepts visible in request metadata: attempt_index, retry_reason, fallback_reason, source_model, resolved_model, provider_route, estimated_retry_cost, and final_billable_cost. This complements a broader model routing vs. fallback policy.

Operational alerts to add

Fallback ratio by provider, model alias, tenant, and customer API key.
Retry cost as a percentage of total AI API spend.
Requests with streamed partial output followed by a second attempt.
Tenants whose retry spend exceeds the normal baseline for their plan.
Provider routes that trigger repeated timeouts before quota blocks occur.

A simple rule: retries should improve reliability, not hide instability. If retry spend becomes material, treat it as an incident and follow a cost anomaly runbook.

For provider-level outages, pair this retry budget with an AI API provider failover runbook so retries, fallback, and provider switching stay inside the same cost policy.

A streaming timeout is one of the easiest ways to accidentally spend retry budget twice; pair this policy with a streaming timeout policy so partial responses, client cancellations, and provider idle windows are handled consistently.

FerryAPI angle: FerryAPI centralizes OpenAI-compatible routing, customer API keys, quotas, prepaid balances, usage records, and provider pools, so retry budgets can be enforced before extra model calls create surprise spend.

For duplicate transport retries, pair this policy with AI API idempotency keys so one customer request maps to one canonical billing event.

Client retries should also respect AI API rate limit headers, especially Retry-After, so retry budgets do not amplify throttled traffic.