AI Infrastructure

How AI Infra Gets Built (Badly)

It starts with a notebook
that worked once.

Almost every AI infrastructure problem we audit started the same way: a researcher built a working POC, the team shipped it, and then production traffic showed up. Here’s the curve, three rows down:

01 The POC works

The research win.

A Jupyter notebook with a wrapped Hugging Face model serves the first 100 customers. Everyone celebrates. The “infrastructure” is one EC2 instance, a Flask wrapper, and a senior ML engineer keeping it alive on weekends.

100 customers 1 GPU ~$1.2k/mo

Jupyter notebook with model code on laptop

Month 1–3

02 Real traffic

The latency wall.

Customer count crosses 1,000. p99 latency creeps past 3 seconds. Your team Band-Aids it with bigger GPUs — spend doubles. Then doubles again. The ML engineer is now full-time on infra. Half the team migrates to Bedrock to “make it someone else’s problem.”

1k+ customers p99 > 3s ~$12k/mo

GPU cluster running ML workloads under load

Month 4–8

03 Bill shock

The vendor bill shock.

Monthly Bedrock or Vertex bill crosses $80k. CFO asks “do we even need this LLM?” Engineering says yes. Finance says prove it. Nobody has the eval data to defend the spend. The cost curve is now a board-level conversation. The team is in the worst position to make a calm decision.

Bedrock dependency No eval data $80k+/mo

Bill review meeting with rising AI costs

Month 9+

What’s Broken

Five patterns we find in every AI stack we audit.

These aren’t speculative. They’re the failure modes we see in 9 of every 10 production AI infrastructures — including teams running on Bedrock, Vertex AI, and SageMaker.

01

Your GPUs run at 14% utilization.

A standard Hugging Face pipeline() call keeps your A100 or H100 mostly idle under real traffic. It processes one request sequentially while everything else queues. The fix isn’t buying bigger GPUs — it’s switching from naive serving to vLLM with continuous batching.

Typical waste

86%

of GPU capacity paid for but idle on the default serving stack

02

Bedrock bills compound quietly.

Cost per million tokens looks fine at low volume. At scale it eats your margin. Most teams discover this when the monthly bill crosses $80k and finance asks for a justification. By then, the obvious savings move (self-host) feels too risky to attempt mid-quarter.

Hidden cost

3–5×

what self-hosted vLLM would cost on the same model and traffic shape

03

p99 latency kills user trust.

The median request feels fast. The slow 1% feels broken. Naive serving doesn’t separate time-to-first-token from generation throughput, so optimization targets get mixed up. Result: streaming chat that pauses, agent workflows that time out, and customer churn the dashboard doesn’t predict.

Typical p99

~3.2s

vs ~480ms on a tuned vLLM stack with the same model

04

Vendor lock-in creeps in.

You build on Bedrock Knowledge Bases, then Bedrock Agents, then Bedrock Guardrails. Six months later, switching means rewriting your application layer. The cloud vendor’s pricing power grows every quarter you stay. Most teams don’t realize they’re locked in until they try to leave.

Switching cost

~6 mo

of engineering time to leave once you’re past the integration honeymoon

05

No eval harness — vibes-only.

When you change the prompt, the model, or the temperature, nobody can prove it got better. Your team merges based on “feels okay in my testing.” Regressions ship. Customer reports come in. AI without evals is software development without tests — you can ship like that for a while, then it catches up.

Ship confidence

low

every model swap or prompt change becomes a Big Risky Decision

Three Paths To Production AI

Bedrock, SageMaker, or own the stack.

Most teams default to the managed services because self-hosting feels scary. Here’s the honest comparison — including where managed actually beats self-hosted.

Fully managed, pay per token.

Bedrock and Vertex AI take infrastructure off your plate completely. The trade-off is that cost scales linearly (and then non-linearly) with usage, and the vendor’s product roadmap dictates what you can use. Great for POCs and low-volume production.

Zero infrastructure to manage — truly
Frontier models (Claude, Gemini) built-in
Fast to start — great for POCs
Cost compounds non-linearly with usage
Lock-in via Agents, Knowledge Bases, Guardrails

Real cost at scale

$4–6

per 1M tokens on Llama-3 70B-equivalent workloads. Premium frontier models like Claude Sonnet run higher.

Best for: POCs, low volume, frontier-only use cases

You pick the model. Cloud picks everything else.

SageMaker and Vertex endpoints let you bring your own model but still tie you to one cloud’s GPU pricing and a default serving container that rarely uses vLLM. Cheaper than Bedrock but the middle ground often gives you the worst of both options.

You pick the model and the GPU shape
Some autoscaling built-in (basic)
Cheaper than Bedrock at high volume
Default container rarely uses vLLM — low GPU util
Still locked to one cloud’s GPU pricing

Real cost at scale

$2–3

per 1M tokens on a 70B model. Better than Bedrock, but you’re paying premium for a default serving stack that wastes most of the GPU.

Best for: teams not ready for self-hosted yet

Multi-cloud, engineer-tuned, no lock-in.

vLLM with PagedAttention + Triton routing on your Kubernetes. Runs on AWS, GCP, Azure, CoreWeave, or RunPod. Model-agnostic — swap Llama for Mistral or Qwen in hours, not quarters. This is what Stripe, Cohere, Meta, and Mistral run in production.

PagedAttention · 80%+ GPU utilization achievable
Continuous batching · 2–24x throughput uplift
Multi-cloud · AWS, GCP, Azure, CoreWeave, RunPod
Model-agnostic — swap models in hours, not quarters
Stripe migrated to this stack — cut their cost 73%

Real cost on tuned vLLM

$0.80–1.50

per 1M tokens on Llama-3 70B with tuned vLLM on H100 SXM. Hand-tuned configs hit the lower bound under steady load.

Best for: any team past POC scale — the default we recommend

What We Ship

Six capability blocks.
Production-grade ML infrastructure.

Every AI Infrastructure engagement covers the same six areas. Depth varies with scope; nothing on this list is optional in a production deployment.

The output

An ML platform your team actually owns.

We don’t sell you a tool subscription or wrap your model in our proprietary serving layer. We ship open-source infrastructure (vLLM, Triton, KServe, MLflow) in your repo, configured for your traffic shape and model mix. Your engineers run it after we leave.

vLLMTritonTensorRT-LLMKServeMLflowKarpenterPrometheus

73% cost cut avg

Optimized inference serving

vLLM with PagedAttention and continuous batching for chat and streaming. TensorRT-LLM compiled engines for max-throughput batch workloads. Triton on the front for mixed-model routing. Tuned per your traffic shape — not generic defaults.

vLLMTensorRT-LLMTriton

GPU-aware Kubernetes

Karpenter or Cluster Autoscaler tuned for GPU node groups. NVIDIA device plugin, MIG partitioning where useful, spot GPU sourcing on AWS and GCP. Node-level scheduling that keeps utilization above 75% without over-provisioning.

KarpenterNVIDIA pluginMIG

Fine-tuning pipeline

LoRA and QLoRA pipelines on Hugging Face or Unsloth, hyperparameter tuning, base model selection. We ship the pipeline as IaC so your team can re-run it on new data without rebuilding the harness. Includes MLflow tracking for every run.

LoRA / QLoRAUnslothMLflow

Eval harness

An eval pipeline using your golden datasets, LLM-as-judge where appropriate, and regression detection on every PR. Promptfoo or DeepEval as the framework. We deliver the harness AND the first 100 eval cases pulled from your real traffic.

PromptfooDeepEvalCI gates

ML observability

Per-request trace logging, TTFT and throughput histograms, GPU utilization by model and tenant, cost-per-1M-tokens dashboards. Built on Prometheus + Grafana — or your existing Datadog if that’s where the team lives.

PrometheusOpenTelemetryPer-tenant cost

Cost guardrails

Per-tenant token budgets, rate limits on expensive routes, automatic fall-through to cheaper models when budgets approach. Cost anomaly detection on every account. Monthly written cost review so the LLM line item never surprises your CFO again.

Per-tenant budgetsModel routingAnomaly alerts

The Optimization Stack

Each layer compounds the savings.

This is the actual cost-per-1M-token math we walk clients through. Same model (Llama-3 70B), same H100 fleet, different software stack and config. Costs are typical — your numbers will vary by workload shape.

Baseline: HF pipeline()

Default Hugging Face on bare GPU

Where most teams start

$4.20

+ INT8 quantization

VRAM footprint cut ~50%

Quick win, minimal accuracy loss

$3.05

+ vLLM PagedAttention

KV cache fragmentation eliminated

2–24x throughput uplift

$1.98

+ Continuous batching

No idle GPU between requests

Saturates GPU under traffic

$1.39

+ Speculative decoding

Draft model accelerates target

Big latency win + cost cut

$1.18

Final: tuned vLLM stack

What we hand you

Production-tuned

$1.01

Cost per 1M output tokens · Llama-3 70B · H100 SXM −76% total reduction · same hardware

What You Walk Away With

Numbers your Head of AI can defend to the board.

Averaged across our last 11 AI Infrastructure engagements. Your numbers will vary by workload shape, model choice, and traffic profile — but the direction is consistent.

−0%

Median inference cost reduction

vs prior managed-service spend

0x

Throughput uplift on same GPU fleet

vLLM + tuned batching

0ms

Median TTFT after tuning

Down from 740ms pre-engagement

0%

Average GPU utilization achieved

vs 14% on naive serving

How It Runs

Three phases. One named senior ML engineer from day one.

Each engagement runs 10–14 weeks depending on model count and traffic complexity. A senior ML platform engineer is named on day one and stays on your Slack and your calls the entire time.

i

Weeks 1–2 · baseline & audit

Inference baseline

Read-only audit of your current AI stack: models in use, hosting (Bedrock / Vertex / SageMaker / self-hosted), traffic patterns, p99 latency, GPU utilization, cost per million tokens. We benchmark your current setup so the “after” numbers have a defensible “before.”

You’ll have

Inference cost-per-token baseline
GPU utilization audit
Top 10 optimization opportunities
Locked scope & price for build phase

ii

Weeks 3–11 · build & migrate

Build the self-hosted stack

Deploy vLLM or Triton in your Kubernetes cluster. Wire up GPU autoscaling. Migrate traffic gradually using a router that splits between old and new. Run shadow traffic for eval-set validation. Ship the fine-tuning pipeline and the eval harness. Cut over service-by-service — never a flag day.

You’ll have

vLLM / Triton in production
GPU-aware K8s autoscaling
Fine-tuning pipeline & eval harness
Cost guardrails + observability
Migration completed service-by-service

iii

Weeks 12–14 · handover & first eval cycle

Handover & first eval cycle

Walkthroughs of every dashboard, runbook, and deployment script. We facilitate the first full eval cycle on your harness and the first model-swap drill. 30 days of post-handover Slack access while your team settles in. After this, your ML platform team owns the stack.

You’ll have

Recorded walkthrough sessions
1 verified model-swap drill
1 facilitated eval cycle
30-day Slack support window

Three Ways To Engage

From a migration sprint to fully managed inference.

All three start with the baseline audit. What differs is how much we stay around after the migration ships.

One-time

Migration sprint

$28k

fixed

10–14 weeks · one-time engagement

Audit, build, migrate, handover. Your team owns inference operations from week 15. Best for teams with strong ML engineers who just need the platform built right once.

Inference baseline + cost audit
vLLM / Triton deployed in your repo
GPU-aware K8s autoscaling
Fine-tuning pipeline + eval harness
30 days of post-handover support

Start the sprint

Ongoing

Embedded ML platform

$14k

/month

Month-to-month · cancel anytime

After the sprint, a senior ML platform engineer stays embedded part-time. They join your sprint planning, own model rollouts, tune the eval harness as your data grows. Best for teams shipping new models often.

Everything in the sprint
Senior ML engineer on Slack 8h/day
Monthly written eval & cost review
Model rollouts owned by us
Quarterly cost & latency tuning passes
Cancel any month, no notice required

Discuss embedded

Managed

Managed inference

$28k

/month

Quarterly contract · full ownership

We own the inference stack end-to-end. 24/7 paging on inference incidents, SLA-backed model availability, all model rollouts and eval cycles managed by us. For teams that need production AI ops but can’t justify hiring 3–5 ML platform engineers.

Everything in embedded
24/7 paging on inference incidents
99.9% model-availability SLA
All eval cycles & model swaps owned
Monthly executive AI ops report

Talk about managed

Recent Engagement

One client. From $87k/mo Bedrock to $23k/mo self-hosted.

A 12-week AI Infrastructure sprint at an AI-native Series B. Migration from Bedrock to self-hosted vLLM on a 6-GPU H100 cluster. p99 latency improved alongside the cost cut.

AI Infrastructure vLLM · Triton · AWS H100 12 weeks delivered

From Bedrock dependency to self-hosted Llama-3 70B in 12 weeks

AI-native SaaS · ~60 engineers · Series B

“We were sleepwalking toward a $1.2M annual Bedrock spend. Cloudico’s team rebuilt our inference stack on vLLM in 12 weeks. We didn’t just save money — our p99 latency is half what it was. Customers noticed.”

David Reyes

Head of AI · AI-native SaaS, Series B

The team was running Llama-3 70B equivalents through Bedrock and burning $87k/month at scale. Our 12-week sprint deployed vLLM with PagedAttention on 6 H100 SXM nodes in their AWS account, added a Triton routing layer for mixed-model traffic, shipped a Promptfoo eval harness, and migrated traffic service-by-service over 4 weeks with zero customer-visible regressions. Their team now owns the entire stack.

Stack & tools shipped

vLLMTritonTensorRT-LLMLlama-3 70BKarpenterEKSH100 SXMPromptfooMLflowPrometheusGrafana

−74%

Monthly inference spend reduction

156ms

Median TTFT (was 720ms)

89%

GPU utilization achieved (was 11%)

0

Eval regressions during cutover

Cost per 1M tokens · 12-week migration timeline

Same model class · same traffic shape · different infrastructure

$4.20 baseline $1.08 final

Starting: $4.20/1M tokens (Bedrock) Final: $1.08/1M tokens (tuned vLLM) Total savings: $64k/month recovered Annual: ~$770k/year

Before The Call

The seven questions Heads of AI ask us.

Direct answers to the questions that come up before every AI Infrastructure engagement. Specific to ML platform work — different from the FAQs on the other three service pages.

Ask us directly

Won’t self-hosting be less reliable than Bedrock?

The fear is reasonable. The reality: vLLM and Triton are battle-tested in production at Stripe, Cohere, Meta, and Mistral. When properly deployed on K8s with autoscaling and proper monitoring, self-hosted hits 99.9%+ availability easily. We ship the reliability layer (monitoring, alerting, runbooks, on-call) alongside the inference stack — same discipline as our Reliability Engineering service. If you’ve been told self-hosting is hard, that was true two years ago. It’s not anymore.

What if we need Claude or GPT-4 specifically? We can’t self-host those.

Correct — frontier closed models stay on their APIs. We build a router layer that sends frontier-only workloads to Anthropic/OpenAI/Google APIs and self-hostable workloads to your vLLM cluster. Most clients find that 70–85% of their traffic can run on a self-hosted Llama-3, Mistral, or Qwen model — the remaining 15–30% stays on the frontier APIs for the hardest tasks. The cost math still works dramatically in your favor.

How do we swap models without breaking production?

This is exactly what the eval harness exists for. Before any model swap (Llama-3 to Llama-4, 70B to 7B-fine-tuned, etc.), the new candidate runs through your golden eval set and a shadow traffic stream. The harness blocks the rollout if regressions exceed your tolerance threshold. We deliver the harness with 100+ eval cases pulled from your real traffic. After this, swapping models is a normal engineering operation — not a Big Risky Decision.

We have HIPAA / SOC 2 requirements. Can self-hosted handle that?

Better than Bedrock in many cases. Self-hosted means your data never leaves your VPC, your audit logs are complete, and you control the encryption-at-rest configuration. We deploy in HIPAA-eligible AWS regions with encrypted EBS, KMS-managed keys, and CloudTrail evidence trails. For SOC 2, the same discipline applies. Several of our clients chose self-hosted specifically for compliance reasons — not in spite of them.

Do we need our own GPU cluster, or can we use spot capacity?

Both work, depending on traffic shape. Baseline traffic usually goes on reserved or on-demand H100 / A100 nodes for stable latency. Peak / training can run on spot GPUs (40-70% cheaper) using AWS Spot, GCP Spot, CoreWeave, or RunPod. The Karpenter configuration we ship handles the placement decisions. For most clients, a hybrid setup beats either pure approach.

What about RAG, agents, and vector stores?

In scope when relevant. We ship pgvector or Qdrant for vector storage (Bedrock Knowledge Bases replacement), LangGraph or custom orchestration for agent workflows, and the same observability stack covers RAG pipelines and agent execution traces. For the agent layer specifically, we deliberately avoid Bedrock Agents as a primary because the lock-in cost is too high. Open patterns let you swap providers as the field evolves.

What does pricing actually include?

AI Infrastructure sprints start at $28k fixed-scope for 10–14 weeks, scope confirmed in writing after the audit. Typical range $34–72k depending on model count, traffic volume, and compliance scope. Embedded ML platform retainers from $14k/month, cancel anytime. Managed inference from $28k/month on a quarterly contract. If we miss our committed timeline you don’t pay for the overrun. All pricing is fixed — never variable, never hourly. GPU costs (your AWS/GCP bill) are separate and you keep them in your own account — we never resell compute.

Ready when you are

From a $87k Bedrock bill to an ML platform you own.

Book a 30-minute AI infra call. Senior ML platform engineer on the call. We’ll look at your current stack together and tell you whether self-hosting makes sense for your workload — honestly.

Book AI Infra Call Compare all services

30-min consult Mutual NDA available Read-only access only No obligation

Production LLM serving without the Bedrock bill.

It starts with a notebookthat worked once.

The research win.

The latency wall.

The vendor bill shock.

Five patterns we find in every AI stack we audit.

Your GPUs run at 14% utilization.

Bedrock bills compound quietly.

p99 latency kills user trust.

Vendor lock-in creeps in.

No eval harness — vibes-only.

Bedrock, SageMaker, or own the stack.

Fully managed, pay per token.

You pick the model. Cloud picks everything else.

Multi-cloud, engineer-tuned, no lock-in.

Six capability blocks.Production-grade ML infrastructure.

An ML platform your team actually owns.

Optimized inference serving

GPU-aware Kubernetes

Fine-tuning pipeline

Eval harness

ML observability

Cost guardrails

Each layer compounds the savings.

Numbers your Head of AI can defend to the board.

Three phases. One named senior ML engineer from day one.

Inference baseline

Build the self-hosted stack

Handover & first eval cycle

From a migration sprint to fully managed inference.

Migration sprint

Embedded ML platform

Managed inference

One client. From $87k/mo Bedrock to $23k/mo self-hosted.

The seven questions Heads of AI ask us.

From a $87k Bedrock bill to an ML platform you own.

Production LLM serving
without the Bedrock bill.

It starts with a notebook
that worked once.

Six capability blocks.
Production-grade ML infrastructure.