Caching, batch & routing

This is how Relay gives you the lowest effective cost — not just a low sticker price. All three are on by default; you can tune or disable them per request.

Prompt caching

Repeated prefixes (long system prompts, RAG context, few-shot examples) are cached. Cache reads are billed at roughly 10% of the input rate — up to a 90% saving on the cached portion.

{ "model": "deepseek-v4-flash", "messages": [...], "cache": true }

Set "cache": false to opt out for a request.

Batch lane

For work where no one is waiting (evals, labeling, nightly jobs, RAG pre-processing), submit to the batch lane and pay ~50% less. Results return asynchronously.

POST https://api.relay.com/v1/batches
{ "model": "qwen3-235b", "requests": [ ... ] }

Stack batch + caching on shared-prefix workloads and effective cost can drop to ~25% of the on-demand rate.

Cheapest-first routing

The same open model is served by multiple providers at different prices. We route each request to the cheapest healthy provider that meets your latency target, and report which one served it in x_relay.provider.

{ "model": "kimi-k2-6", "messages": [...], "route": "cheapest" }   // default
{ "model": "kimi-k2-6", "messages": [...], "route": "fastest" }    // optimize latency
{ "model": "kimi-k2-6", "messages": [...], "route": "DeepInfra" }  // pin a provider

Smart cascading (optional)

Route easy requests to a smaller, cheaper model and escalate only hard ones. Ask us to enable cascading policies for your account.

Seeing your savings

Every response includes x_relay.saved_usd, and your dashboard shows a running "saved this month" total. Estimate before you switch with the cost calculator.

← Chat completions Next: Errors & rate limits →