Streaming operations
AI API Streaming Timeout Policy for SaaS Teams
A practical streaming timeout policy for SaaS teams running OpenAI-compatible AI gateways: handle idle streams, cancellation, partial output, retries, billing, and customer-visible errors safely.
Why streaming timeouts deserve their own policy
Streaming AI responses feel simple to users: tokens appear until the answer is done. Operationally, a stream can fail in many partial states: the provider accepted the request but stopped sending chunks, the browser disconnected, the customer canceled the job, a mobile network stalled, or the gateway timed out while the provider kept billing.
A clear timeout policy prevents teams from treating every interrupted stream as a generic retry. In an OpenAI-compatible gateway, timeout handling should connect reliability, customer experience, quota, prepaid balance, and usage ledger behavior.
Timeout types to define
| Timeout | Recommended policy | Billing note |
|---|---|---|
| Connect timeout | Fail fast before provider acceptance; allow one bounded retry or approved fallback route. | Usually no provider usage should be recorded unless an upstream request ID exists. |
| First-token timeout | Retry only for idempotent requests and only within the request retry budget. | Reserve balance before retrying because the provider may still finish the first attempt. |
| Idle chunk timeout | Close the stream after a documented idle window and return a partial-output status. | Record accepted tokens and final provider status when it becomes available. |
| Client cancellation | Propagate cancellation upstream when supported; otherwise mark the request as client-aborted. | Do not silently refund until provider usage is reconciled. |
| Total stream deadline | Set tier-specific maximum duration for agents, long summaries, and code generation. | Use the same canonical request ID for reservation, settlement, and any partial refund. |
Gateway fields to log
stream_started_at,first_token_at,last_chunk_at, andstream_closed_at.close_reason: completed, provider_timeout, idle_timeout, client_cancelled, gateway_deadline, or fallback_started.canonical_request_idshared across retry, fallback, settlement, and support exports.prompt_tokens,completion_tokens_observed,provider_tokens_final, andbillable_tokens_final.prepaid_reservation_id,reserved_amount,settled_amount, and refund status.
Safe retry and fallback behavior
Streaming timeouts should reuse the same retry budget discipline as non-streaming calls, but with stricter idempotency rules. A customer may have already seen partial text, run a tool call, or queued a workflow step. Retrying blindly can duplicate actions, confuse the user, and double provider spend.
When the timeout is provider-side and the workload is safe to retry, follow a bounded gateway retry budget policy. When the primary provider is unhealthy, route only through an approved provider failover runbook. For long-form tasks where exact output matters, prefer returning a partial result plus a continuation option instead of silently switching models mid-answer.
Customer-visible response contract
Do not expose internal provider noise to the user, but do expose enough state for applications to recover. A useful response includes whether output is complete, whether retry is safe, the canonical request ID, and whether billing is final or pending reconciliation.
{
"status": "partial",
"reason": "idle_timeout",
"retry_safe": false,
"request_id": "req_...",
"billing_status": "pending_provider_reconciliation"
}
Billing reconciliation after stream interruption
The most expensive mistake is refunding or rebilling based only on chunks observed by the client; pair streaming rules with an explicit AI API refund policy for failed requests. The provider may report a different final token count after a stalled stream. Keep the request in a pending settlement state until the gateway usage ledger and provider invoice can be matched through usage ledger design and multi-provider reconciliation.
How FerryAPI helps
FerryAPI is an OpenAI-compatible AI API gateway for teams that need practical model routing, customer API keys, quota policy, prepaid balance controls, and billing-ready usage records across provider routes. It helps SaaS teams make streaming failures predictable instead of leaving timeout behavior scattered across SDKs and front-end code.
Use FerryAPI to centralize OpenAI-compatible streaming policy, customer quotas, prepaid balances, retries, failover, and usage logs.
When clients reconnect after an interrupted stream, AI API idempotency keys help the gateway resume or report the original canonical request instead of starting duplicate generations.