The first call started with a number: $87,400. That was the previous month’s AWS Bedrock bill at a Series B AI startup. The month before had been $54k. Six months earlier, it was $3,900. User growth across the same period: 1.8×.

The CFO had two questions. What’s going wrong? And when will it stop?
Why the curve isn’t linear
Most teams budget LLM inference as cost-per-user × users. That’s wrong. Three things compound on top of base usage:
- Context window growth. As products mature, prompts get longer. We measured a 3.4× average prompt length increase over 9 months in one codebase.
- Retry storms. When latency spikes, retries spike. Bedrock charges for the retry tokens.
- Eval costs. Production traffic generates eval traffic. If you’re shadow-testing a new model, you’re paying for both.
The 9-month migration window
From the time you notice the slope to the time a self-hosted plan is in production, plan for nine months. That’s not pessimism — it’s what every team we’ve done this with has actually experienced. Trying to do it in three months means cutting corners that cost more later.
What we got wrong on the vLLM cutover
Week three of our migration, we discovered the evaluation harness their team built was inadequate to compare model variants under production load patterns. We had to write one from scratch, which pushed the right-sizing work into the final week. The savings still held. But the project ran 8 days over schedule and we ate the cost. Honest version: that was a scoping error on our part.
Three numbers to track
If you’re at the start of this curve, instrument these now:
- Cost per active user, per model. Not aggregate. By model.
- Tokens per request, p95. The p95, not the average, predicts your future bill.
- Retry rate per endpoint. Spikes here forecast cost spikes within hours.
The operating test
We treat this as real only when it changes a dashboard, a runbook, and one named engineer’s weekly work. If the idea cannot survive those three places, it is probably just a slide.
The useful version is specific, measurable, and owned by someone who can say what changed after it shipped.
What we would do differently
- Instrument before changing architecture. The baseline decides whether the fix worked.
- Name the trade-off. Every improvement costs latency, money, complexity, or time somewhere else.
- Revisit it after 30 days. Production has a way of teaching what the workshop missed.