Caching, batch & routing
This is how Relay gives you the lowest effective cost — not just a low sticker price. All three are on by default; you can tune or disable them per request.
Prompt caching
Repeated prefixes (long system prompts, RAG context, few-shot examples) are cached. Cache reads are billed at roughly 10% of the input rate — up to a 90% saving on the cached portion.
{ "model": "deepseek-v4-flash", "messages": [...], "cache": true }
Set "cache": false to opt out for a request.
Batch lane
For work where no one is waiting (evals, labeling, nightly jobs, RAG pre-processing), submit to the batch lane and pay ~50% less. Results return asynchronously.
POST https://api.relay.com/v1/batches
{ "model": "qwen3-235b", "requests": [ ... ] }
Cheapest-first routing
The same open model is served by multiple providers at different prices. We route each request to the cheapest healthy provider that meets your latency target, and report which one served it in x_relay.provider.
{ "model": "kimi-k2-6", "messages": [...], "route": "cheapest" } // default
{ "model": "kimi-k2-6", "messages": [...], "route": "fastest" } // optimize latency
{ "model": "kimi-k2-6", "messages": [...], "route": "DeepInfra" } // pin a provider
Smart cascading (optional)
Route easy requests to a smaller, cheaper model and escalate only hard ones. Ask us to enable cascading policies for your account.
Seeing your savings
Every response includes x_relay.saved_usd, and your dashboard shows a running "saved this month" total. Estimate before you switch with the cost calculator.