Cost incident response

When Your LLM API Costs Spike: An Incident Checklist for SaaS Teams

A practical incident checklist for SaaS teams handling sudden LLM API cost spikes: blast-radius triage, scoped spend controls, retry policy, fallback routing, attribution, and post-incident guardrails.

Phase 1: Confirm the blast radius

Start with questions that identify ownership and urgency.

Which endpoint, app, or workflow changed first?
Is the spike concentrated in one customer, one feature, or one model?
Is usage driven by new traffic, retries, longer prompts, or fallback routing?
Are errors increasing, or is the system succeeding expensively?
Did a deploy, prompt edit, agent tool change, or provider incident happen recently?

Minimum metadata to inspect:

tenant or customer id;
app / environment;
feature or workflow name;
request id / trace id;
selected model;
prompt and completion token counts;
retry count;
fallback path;
approximate cost.

If you cannot answer these quickly, the incident is also an attribution incident.

---

Phase 2: Stop runaway spend without breaking everything

Prefer scoped controls over global shutdowns.

Good first controls

Set a temporary hard quota on the affected tenant or workflow.
Disable the specific agent/tool path that is looping.
Cap max output tokens for the affected route.
Reduce retry count for non-critical traffic.
Route low-priority traffic to a lower-cost model.
Keep critical production flows on the most reliable path.

Risky controls

Switching every request to a cheaper model without quality checks.
Disabling all fallbacks during a provider incident.
Letting retry middleware and agent retry logic both run independently.
Treating all 5xx/provider failures as retryable forever.

A useful rule: customer-visible critical paths deserve reliability; background automation deserves budgets.

---

Phase 3: Separate retryable failures from non-retryable failures

Not every failed LLM call should be retried.

Retryable examples:

transient provider 5xx;
temporary network timeout;
provider rate limit if budget and user experience allow it;
short-lived queue pressure.

Usually not retryable:

invalid API key;
malformed request;
context length exceeded;
policy or permission error;
deterministic prompt/tool bug;
repeated tool-call loop with no new state.

For agent workflows, add a stricter rule: if the next retry does not have new information, it is probably not a retry; it is a loop.

---

Phase 4: Inspect fallback routing

Fallback is valuable, but it can create cost surprises.

Check:

Which model was primary?
Which model did traffic fall back to?
Was the fallback more expensive?
Did the fallback preserve required quality?
Did fallback trigger for all traffic or only critical traffic?
Did the route fail back after recovery?

A simple fallback matrix:

Traffic class	Example	Fallback behavior
Critical user-facing	paid user support response	retry briefly, then fallback to reliable model
Important but async	report generation	queue, retry later, fallback only if budget allows
Low-priority batch	enrichment, tagging	pause or use cheaper model
Experimental agent	internal prototype	hard cap, no unlimited fallback

The matrix should be decided before the incident, not during it.

---

Phase 5: Add attribution before reopening the floodgates

Before removing temporary caps, make sure future usage can be traced.

Every production LLM request should carry enough metadata to answer:

who caused this cost;
what product feature caused it;
which model path was used;
whether retry/fallback happened;
whether the request was user-facing or background work.

Suggested metadata fields:

{
  "tenant_id": "customer_123",
  "app": "support-copilot",
  "environment": "production",
  "feature": "ticket-summary",
  "workflow": "agent-triage-v2",
  "request_class": "user_facing",
  "trace_id": "..."
}

This is where an OpenAI-compatible gateway helps: apps can keep the same SDK shape while routing, budget, key scoping, and logging happen at the access layer.

---

Phase 6: Write the post-incident policy

Close the loop with policy, not just a dashboard screenshot.

Document:

Trigger condition: what alerted you?
Root cause: traffic, prompt, retry, model, provider, or deploy?
Customer impact: latency, quality, errors, or only cost?
Temporary controls applied.
Permanent controls added.
Owner for each affected workflow.
New budget/quota limits.
New fallback behavior.
Metadata gaps discovered.

A good outcome is not just lower spend. A good outcome is that the next spike is easier to explain in five minutes.

---

Practical gateway-level controls

For SaaS teams running multiple apps or customers, the access layer should support:

OpenAI-compatible endpoint shape;
separate API keys per app/customer/workflow;
per-key and per-route quotas;
request metadata logging;
model routing rules;
fallback policies;
provider-level and tenant-level usage views;
fast key rotation;
environment separation;
simple export for finance/product review.

FerryAPI fits this pattern as a low-cost OpenAI-compatible AI API gateway: route usage through a single gateway, scope keys by workload, and keep spend attributable while preserving familiar API integration patterns.

Docs: https://www.ferryapi.io/docs?utm_source=content&utm_medium=checklist&utm_campaign=7day_growth

---

Need the controls behind this checklist?
FerryAPI provides an OpenAI-compatible AI API gateway for scoped keys, usage attribution, quotas, routing, and cost-aware production AI operations. Read the integration docs.