Cost incident response
When Your LLM API Costs Spike: An Incident Checklist for SaaS Teams
A practical incident checklist for SaaS teams handling sudden LLM API cost spikes: blast-radius triage, scoped spend controls, retry policy, fallback routing, attribution, and post-incident guardrails.
Phase 1: Confirm the blast radius
Start with questions that identify ownership and urgency.
- Which endpoint, app, or workflow changed first?
- Is the spike concentrated in one customer, one feature, or one model?
- Is usage driven by new traffic, retries, longer prompts, or fallback routing?
- Are errors increasing, or is the system succeeding expensively?
- Did a deploy, prompt edit, agent tool change, or provider incident happen recently?
Minimum metadata to inspect:
- tenant or customer id;
- app / environment;
- feature or workflow name;
- request id / trace id;
- selected model;
- prompt and completion token counts;
- retry count;
- fallback path;
- approximate cost.
If you cannot answer these quickly, the incident is also an attribution incident.
---
Phase 2: Stop runaway spend without breaking everything
Prefer scoped controls over global shutdowns.
Good first controls
- Set a temporary hard quota on the affected tenant or workflow.
- Disable the specific agent/tool path that is looping.
- Cap max output tokens for the affected route.
- Reduce retry count for non-critical traffic.
- Route low-priority traffic to a lower-cost model.
- Keep critical production flows on the most reliable path.
Risky controls
- Switching every request to a cheaper model without quality checks.
- Disabling all fallbacks during a provider incident.
- Letting retry middleware and agent retry logic both run independently.
- Treating all 5xx/provider failures as retryable forever.
A useful rule: customer-visible critical paths deserve reliability; background automation deserves budgets.
---
Phase 3: Separate retryable failures from non-retryable failures
Not every failed LLM call should be retried.
Retryable examples:
- transient provider 5xx;
- temporary network timeout;
- provider rate limit if budget and user experience allow it;
- short-lived queue pressure.
Usually not retryable:
- invalid API key;
- malformed request;
- context length exceeded;
- policy or permission error;
- deterministic prompt/tool bug;
- repeated tool-call loop with no new state.
For agent workflows, add a stricter rule: if the next retry does not have new information, it is probably not a retry; it is a loop.
---
Phase 4: Inspect fallback routing
Fallback is valuable, but it can create cost surprises.
Check:
- Which model was primary?
- Which model did traffic fall back to?
- Was the fallback more expensive?
- Did the fallback preserve required quality?
- Did fallback trigger for all traffic or only critical traffic?
- Did the route fail back after recovery?
A simple fallback matrix:
| Traffic class | Example | Fallback behavior |
|---|---|---|
| Critical user-facing | paid user support response | retry briefly, then fallback to reliable model |
| Important but async | report generation | queue, retry later, fallback only if budget allows |
| Low-priority batch | enrichment, tagging | pause or use cheaper model |
| Experimental agent | internal prototype | hard cap, no unlimited fallback |
The matrix should be decided before the incident, not during it.
---
Phase 5: Add attribution before reopening the floodgates
Before removing temporary caps, make sure future usage can be traced.
Every production LLM request should carry enough metadata to answer:
- who caused this cost;
- what product feature caused it;
- which model path was used;
- whether retry/fallback happened;
- whether the request was user-facing or background work.
Suggested metadata fields:
{
"tenant_id": "customer_123",
"app": "support-copilot",
"environment": "production",
"feature": "ticket-summary",
"workflow": "agent-triage-v2",
"request_class": "user_facing",
"trace_id": "..."
}
This is where an OpenAI-compatible gateway helps: apps can keep the same SDK shape while routing, budget, key scoping, and logging happen at the access layer.
---
Phase 6: Write the post-incident policy
Close the loop with policy, not just a dashboard screenshot.
Document:
- Trigger condition: what alerted you?
- Root cause: traffic, prompt, retry, model, provider, or deploy?
- Customer impact: latency, quality, errors, or only cost?
- Temporary controls applied.
- Permanent controls added.
- Owner for each affected workflow.
- New budget/quota limits.
- New fallback behavior.
- Metadata gaps discovered.
A good outcome is not just lower spend. A good outcome is that the next spike is easier to explain in five minutes.
---
Practical gateway-level controls
For SaaS teams running multiple apps or customers, the access layer should support:
- OpenAI-compatible endpoint shape;
- separate API keys per app/customer/workflow;
- per-key and per-route quotas;
- request metadata logging;
- model routing rules;
- fallback policies;
- provider-level and tenant-level usage views;
- fast key rotation;
- environment separation;
- simple export for finance/product review.
FerryAPI fits this pattern as a low-cost OpenAI-compatible AI API gateway: route usage through a single gateway, scope keys by workload, and keep spend attributable while preserving familiar API integration patterns.
Docs: https://www.ferryapi.io/docs?utm_source=content&utm_medium=checklist&utm_campaign=7day_growth
---
FerryAPI provides an OpenAI-compatible AI API gateway for scoped keys, usage attribution, quotas, routing, and cost-aware production AI operations. Read the integration docs.