Billing architecture
LLM Prepaid Balance Implementation for SaaS AI Products
A practical implementation guide for SaaS teams adding LLM prepaid balances, reservations, quota checks, settlement, refunds, and invoice-ready usage records.
Why prepaid balance logic belongs near the gateway
AI SaaS products often start with one provider account and one monthly bill. That works until customers need usage-based pricing, agents can spend money in the background, and finance needs to know whether a request should be accepted before it reaches a model provider. A prepaid balance system should sit close to the OpenAI-compatible gateway so every request can be checked, reserved, routed, and settled with the same policy.
The implementation goal is simple: never let an unowned LLM request create surprise provider spend, and never make the billing ledger depend on provider invoices alone.
Core data model
| Object | Required fields | Implementation note |
|---|---|---|
| Tenant balance | tenant_id, currency, available_balance, reserved_balance, credit_limit, status | Keep available and reserved amounts separate so long-running requests do not double spend. |
| Price table | provider, model, input_token_price, output_token_price, cached_token_price, effective_at | Version prices so old usage can be reconciled even after provider pricing changes. |
| Reservation | reservation_id, request_id, tenant_id, estimated_cost, expires_at, status | Expire abandoned reservations and release balance automatically. |
| Usage event | request_id, api_key_id, feature, model, tokens, provider_cost, customer_cost, policy_version | Make the usage event invoice-ready, not just observability metadata. |
| Ledger entry | entry_id, tenant_id, amount, type, source_id, created_at | Use append-only entries for top-ups, reservations, settlements, refunds, adjustments, and credits. |
Request lifecycle
- Identify the tenant: resolve the customer API key before provider routing. Reject unowned or suspended keys early.
- Estimate spend: use model, prompt tokens, max output tokens, cached-token policy, and route-specific markup to estimate a worst-case cost.
- Reserve balance: atomically move the estimate from available balance to reserved balance, or apply the tenant over-limit policy.
- Route the model call: send the request to the approved provider/model only after balance and quota checks pass.
- Settle actual usage: calculate real input/output/cached token cost, release unused reservation, and append a usage ledger entry.
- Export billing data: group usage by tenant, feature, model, and invoice period for dashboards and finance reconciliation.
Over-limit policies to define up front
- Hard stop: reject with a clear billing error before provider spend occurs.
- Soft grace: allow a small negative balance for trusted paid plans, then notify admins.
- Model downgrade: route to a cheaper model only when the feature can tolerate quality differences.
- Queue for top-up: hold background jobs until a balance webhook or manual top-up arrives.
- Admin override: require an auditable policy version and expiry time, not an informal database edit.
Failure modes that cause billing drift
| Failure mode | Prevention |
|---|---|
| Provider timeout after tokens were generated | Record provider request ids and reconcile against provider usage exports. |
| Retry counted as two customer requests | Attach idempotency keys and settlement state to the gateway request id. |
| Price table changes mid-period | Store price_version on every usage event and ledger entry. |
| Fallback uses a more expensive model | Require policy approval before fallback can exceed the reserved cost envelope. |
| Streaming response disconnects early | Settle from provider-reported actual usage, not client-visible completion length alone. |
Operational metrics
- Reservation denial rate by tenant, feature, and model.
- Estimated cost vs actual cost variance.
- Expired reservation amount and count.
- Provider invoice total vs gateway ledger total by billing period.
- Top customers approaching balance, quota, or credit-limit thresholds.
Where FerryAPI fits
FerryAPI is an OpenAI-compatible API gateway for teams that need model routing, customer API keys, quota controls, and usage billing across providers. Related implementation guides: AI API usage attribution schema, tenant-level budget guardrails, AI API refund policy for failed requests, and OpenRouter alternative migration plan.
Use FerryAPI to keep OpenAI-compatible requests tied to customer keys, quota policy, provider routing, and invoice-ready usage records. Explore FerryAPI.
For concrete policy templates, see AI API quota policy examples.