If you are serving an LLM-backed feature and your inference bill is climbing faster than your usage, there is a 70% chance the cause is uniform model selection. Every request hits the same model regardless of what the request actually needs.

Right-sizing LLM inference: a three-tier approach
Production context from the Cloudico engineering notebook.

Tier 1: classification

Before you hit any expensive model, classify the request. “What is the user trying to do?” can almost always be answered by a fine-tuned small model (DistilBERT, Llama 3 8B) at <$0.0001 per request.

Tier 2: cheap-and-fast

For ~70% of classified requests, a 7B-13B model will do. Mistral 7B, Llama 3 8B, Claude Haiku. You’re paying 1/10th the cost of frontier models for 90% of the quality on routine tasks.

Tier 3: frontier

For the ~30% of requests that genuinely need it — complex reasoning, multi-step planning, code generation in production-critical paths — route to GPT-4-class or Claude Opus. You’re now paying frontier prices on the requests that justify it, not on “is this email spam?”.

How much this saves

In one engagement we cut an inference bill from $87k/mo to $34k/mo. Of that 61% savings, three-tier routing accounted for ~42 percentage points. Reservation strategy, batching, and prompt compression got the rest.

The operating test

We treat this as real only when it changes a dashboard, a runbook, and one named engineer’s weekly work. If the idea cannot survive those three places, it is probably just a slide.

The useful version is specific, measurable, and owned by someone who can say what changed after it shipped.

What we would do differently

  • Instrument before changing architecture. The baseline decides whether the fix worked.
  • Name the trade-off. Every improvement costs latency, money, complexity, or time somewhere else.
  • Revisit it after 30 days. Production has a way of teaching what the workshop missed.