Now Live AI Infrastructure Audit — Free 30-min review for SaaS & AI teams
Book Discovery Call
Home / Services / AI Infrastructure
AI Infrastructure · 04 of 04 · New

Production LLM serving
without the Bedrock bill.

Self-hosted inference on vLLM and Triton. GPU scheduling that actually works on Kubernetes. Fine-tuning pipelines, eval harnesses, observability built in. Cut inference cost 60–75% while improving p99 latency — without giving up reliability or eval rigor.

0%
Median inference cost reduction post-migration
0x
Throughput uplift on the same GPU fleet
0ms
Median time-to-first-token after tuning
~/cloudico-ml-platform · vllm-serve
live
# start the serving engine on a tuned vLLM config python -m vllm.entrypoints.openai.api_server –model meta-llama/Meta-Llama-3-70B-Instruct –tensor-parallel-size 4 –gpu-memory-utilization 0.92 –max-num-seqs 256 –enable-chunked-prefill [INFO] Loading model shards across 4 × H100 SXM (80GB) [INFO] PagedAttention KV cache: enabled · fragmentation: 2.1% [INFO] Continuous batching: enabled · max_num_seqs=256 [READY] Listening on 0.0.0.0:8000 · warmup complete in 42.6s # first request from the router [REQ] POST /v1/chat/completions · tenant=“acme-prod” [RES] 200 OK · ttft=186ms · throughput=94 tok/s · cost=$0.0011
GPU util 87%
p99 TTFT 312ms
Cost/1M tok $1.12
vs Bedrock −73%
AI infrastructure shipped for YC-backed AI startups, applied-ML teams, and Series A–C SaaS
How AI Infra Gets Built (Badly)

It starts with a notebook
that worked once.

Almost every AI infrastructure problem we audit started the same way: a researcher built a working POC, the team shipped it, and then production traffic showed up. Here’s the curve, three rows down:

01 The POC works

The research win.

A Jupyter notebook with a wrapped Hugging Face model serves the first 100 customers. Everyone celebrates. The “infrastructure” is one EC2 instance, a Flask wrapper, and a senior ML engineer keeping it alive on weekends.

100 customers 1 GPU ~$1.2k/mo
Jupyter notebook with model code on laptop
Month 1–3
02 Real traffic

The latency wall.

Customer count crosses 1,000. p99 latency creeps past 3 seconds. Your team Band-Aids it with bigger GPUs — spend doubles. Then doubles again. The ML engineer is now full-time on infra. Half the team migrates to Bedrock to “make it someone else’s problem.”

1k+ customers p99 > 3s ~$12k/mo
GPU cluster running ML workloads under load
Month 4–8
03 Bill shock

The vendor bill shock.

Monthly Bedrock or Vertex bill crosses $80k. CFO asks “do we even need this LLM?” Engineering says yes. Finance says prove it. Nobody has the eval data to defend the spend. The cost curve is now a board-level conversation. The team is in the worst position to make a calm decision.

Bedrock dependency No eval data $80k+/mo
Bill review meeting with rising AI costs
Month 9+
What’s Broken

Five patterns we find in every AI stack we audit.

These aren’t speculative. They’re the failure modes we see in 9 of every 10 production AI infrastructures — including teams running on Bedrock, Vertex AI, and SageMaker.

01

Your GPUs run at 14% utilization.

A standard Hugging Face pipeline() call keeps your A100 or H100 mostly idle under real traffic. It processes one request sequentially while everything else queues. The fix isn’t buying bigger GPUs — it’s switching from naive serving to vLLM with continuous batching.

Typical waste
86%
of GPU capacity paid for but idle on the default serving stack
02

Bedrock bills compound quietly.

Cost per million tokens looks fine at low volume. At scale it eats your margin. Most teams discover this when the monthly bill crosses $80k and finance asks for a justification. By then, the obvious savings move (self-host) feels too risky to attempt mid-quarter.

Hidden cost
3–5×
what self-hosted vLLM would cost on the same model and traffic shape
03

p99 latency kills user trust.

The median request feels fast. The slow 1% feels broken. Naive serving doesn’t separate time-to-first-token from generation throughput, so optimization targets get mixed up. Result: streaming chat that pauses, agent workflows that time out, and customer churn the dashboard doesn’t predict.

Typical p99
~3.2s
vs ~480ms on a tuned vLLM stack with the same model
04

Vendor lock-in creeps in.

You build on Bedrock Knowledge Bases, then Bedrock Agents, then Bedrock Guardrails. Six months later, switching means rewriting your application layer. The cloud vendor’s pricing power grows every quarter you stay. Most teams don’t realize they’re locked in until they try to leave.

Switching cost
~6 mo
of engineering time to leave once you’re past the integration honeymoon
05

No eval harness — vibes-only.

When you change the prompt, the model, or the temperature, nobody can prove it got better. Your team merges based on “feels okay in my testing.” Regressions ship. Customer reports come in. AI without evals is software development without tests — you can ship like that for a while, then it catches up.

Ship confidence
low
every model swap or prompt change becomes a Big Risky Decision
Three Paths To Production AI

Bedrock, SageMaker, or own the stack.

Most teams default to the managed services because self-hosting feels scary. Here’s the honest comparison — including where managed actually beats self-hosted.

Fully managed, pay per token.

Bedrock and Vertex AI take infrastructure off your plate completely. The trade-off is that cost scales linearly (and then non-linearly) with usage, and the vendor’s product roadmap dictates what you can use. Great for POCs and low-volume production.

  • Zero infrastructure to manage — truly
  • Frontier models (Claude, Gemini) built-in
  • Fast to start — great for POCs
  • Cost compounds non-linearly with usage
  • Lock-in via Agents, Knowledge Bases, Guardrails
Real cost at scale
$4–6
per 1M tokens on Llama-3 70B-equivalent workloads. Premium frontier models like Claude Sonnet run higher.
Best for: POCs, low volume, frontier-only use cases

You pick the model. Cloud picks everything else.

SageMaker and Vertex endpoints let you bring your own model but still tie you to one cloud’s GPU pricing and a default serving container that rarely uses vLLM. Cheaper than Bedrock but the middle ground often gives you the worst of both options.

  • You pick the model and the GPU shape
  • Some autoscaling built-in (basic)
  • Cheaper than Bedrock at high volume
  • Default container rarely uses vLLM — low GPU util
  • Still locked to one cloud’s GPU pricing
Real cost at scale
$2–3
per 1M tokens on a 70B model. Better than Bedrock, but you’re paying premium for a default serving stack that wastes most of the GPU.
Best for: teams not ready for self-hosted yet

Multi-cloud, engineer-tuned, no lock-in.

vLLM with PagedAttention + Triton routing on your Kubernetes. Runs on AWS, GCP, Azure, CoreWeave, or RunPod. Model-agnostic — swap Llama for Mistral or Qwen in hours, not quarters. This is what Stripe, Cohere, Meta, and Mistral run in production.

  • PagedAttention · 80%+ GPU utilization achievable
  • Continuous batching · 2–24x throughput uplift
  • Multi-cloud · AWS, GCP, Azure, CoreWeave, RunPod
  • Model-agnostic — swap models in hours, not quarters
  • Stripe migrated to this stack — cut their cost 73%
Real cost on tuned vLLM
$0.80–1.50
per 1M tokens on Llama-3 70B with tuned vLLM on H100 SXM. Hand-tuned configs hit the lower bound under steady load.
Best for: any team past POC scale — the default we recommend
What We Ship

Six capability blocks.
Production-grade ML infrastructure.

Every AI Infrastructure engagement covers the same six areas. Depth varies with scope; nothing on this list is optional in a production deployment.

The output

An ML platform your team actually owns.

We don’t sell you a tool subscription or wrap your model in our proprietary serving layer. We ship open-source infrastructure (vLLM, Triton, KServe, MLflow) in your repo, configured for your traffic shape and model mix. Your engineers run it after we leave.

vLLMTritonTensorRT-LLMKServeMLflowKarpenterPrometheus
GPU cluster running ML inference
73% cost cut avg

Optimized inference serving

vLLM with PagedAttention and continuous batching for chat and streaming. TensorRT-LLM compiled engines for max-throughput batch workloads. Triton on the front for mixed-model routing. Tuned per your traffic shape — not generic defaults.

vLLMTensorRT-LLMTriton

GPU-aware Kubernetes

Karpenter or Cluster Autoscaler tuned for GPU node groups. NVIDIA device plugin, MIG partitioning where useful, spot GPU sourcing on AWS and GCP. Node-level scheduling that keeps utilization above 75% without over-provisioning.

KarpenterNVIDIA pluginMIG

Fine-tuning pipeline

LoRA and QLoRA pipelines on Hugging Face or Unsloth, hyperparameter tuning, base model selection. We ship the pipeline as IaC so your team can re-run it on new data without rebuilding the harness. Includes MLflow tracking for every run.

LoRA / QLoRAUnslothMLflow

Eval harness

An eval pipeline using your golden datasets, LLM-as-judge where appropriate, and regression detection on every PR. Promptfoo or DeepEval as the framework. We deliver the harness AND the first 100 eval cases pulled from your real traffic.

PromptfooDeepEvalCI gates

ML observability

Per-request trace logging, TTFT and throughput histograms, GPU utilization by model and tenant, cost-per-1M-tokens dashboards. Built on Prometheus + Grafana — or your existing Datadog if that’s where the team lives.

PrometheusOpenTelemetryPer-tenant cost

Cost guardrails

Per-tenant token budgets, rate limits on expensive routes, automatic fall-through to cheaper models when budgets approach. Cost anomaly detection on every account. Monthly written cost review so the LLM line item never surprises your CFO again.

Per-tenant budgetsModel routingAnomaly alerts
The Optimization Stack

Each layer compounds the savings.

This is the actual cost-per-1M-token math we walk clients through. Same model (Llama-3 70B), same H100 fleet, different software stack and config. Costs are typical — your numbers will vary by workload shape.

Baseline: HF pipeline()
Default Hugging Face on bare GPU
Where most teams start
$4.20
+ INT8 quantization
VRAM footprint cut ~50%
Quick win, minimal accuracy loss
$3.05
+ vLLM PagedAttention
KV cache fragmentation eliminated
2–24x throughput uplift
$1.98
+ Continuous batching
No idle GPU between requests
Saturates GPU under traffic
$1.39
+ Speculative decoding
Draft model accelerates target
Big latency win + cost cut
$1.18
Final: tuned vLLM stack
What we hand you
Production-tuned
$1.01
Cost per 1M output tokens · Llama-3 70B · H100 SXM −76% total reduction · same hardware
What You Walk Away With

Numbers your Head of AI can defend to the board.

Averaged across our last 11 AI Infrastructure engagements. Your numbers will vary by workload shape, model choice, and traffic profile — but the direction is consistent.

0%
Median inference cost reduction
vs prior managed-service spend
0x
Throughput uplift on same GPU fleet
vLLM + tuned batching
0ms
Median TTFT after tuning
Down from 740ms pre-engagement
0%
Average GPU utilization achieved
vs 14% on naive serving
How It Runs

Three phases. One named senior ML engineer from day one.

Each engagement runs 10–14 weeks depending on model count and traffic complexity. A senior ML platform engineer is named on day one and stays on your Slack and your calls the entire time.

i
Weeks 1–2 · baseline & audit

Inference baseline

Read-only audit of your current AI stack: models in use, hosting (Bedrock / Vertex / SageMaker / self-hosted), traffic patterns, p99 latency, GPU utilization, cost per million tokens. We benchmark your current setup so the “after” numbers have a defensible “before.”

You’ll have
  • Inference cost-per-token baseline
  • GPU utilization audit
  • Top 10 optimization opportunities
  • Locked scope & price for build phase
ii
Weeks 3–11 · build & migrate

Build the self-hosted stack

Deploy vLLM or Triton in your Kubernetes cluster. Wire up GPU autoscaling. Migrate traffic gradually using a router that splits between old and new. Run shadow traffic for eval-set validation. Ship the fine-tuning pipeline and the eval harness. Cut over service-by-service — never a flag day.

You’ll have
  • vLLM / Triton in production
  • GPU-aware K8s autoscaling
  • Fine-tuning pipeline & eval harness
  • Cost guardrails + observability
  • Migration completed service-by-service
iii
Weeks 12–14 · handover & first eval cycle

Handover & first eval cycle

Walkthroughs of every dashboard, runbook, and deployment script. We facilitate the first full eval cycle on your harness and the first model-swap drill. 30 days of post-handover Slack access while your team settles in. After this, your ML platform team owns the stack.

You’ll have
  • Recorded walkthrough sessions
  • 1 verified model-swap drill
  • 1 facilitated eval cycle
  • 30-day Slack support window
Three Ways To Engage

From a migration sprint to fully managed inference.

All three start with the baseline audit. What differs is how much we stay around after the migration ships.

One-time

Migration sprint

$28k
fixed
10–14 weeks · one-time engagement

Audit, build, migrate, handover. Your team owns inference operations from week 15. Best for teams with strong ML engineers who just need the platform built right once.

  • Inference baseline + cost audit
  • vLLM / Triton deployed in your repo
  • GPU-aware K8s autoscaling
  • Fine-tuning pipeline + eval harness
  • 30 days of post-handover support
Start the sprint
Managed

Managed inference

$28k
/month
Quarterly contract · full ownership

We own the inference stack end-to-end. 24/7 paging on inference incidents, SLA-backed model availability, all model rollouts and eval cycles managed by us. For teams that need production AI ops but can’t justify hiring 3–5 ML platform engineers.

  • Everything in embedded
  • 24/7 paging on inference incidents
  • 99.9% model-availability SLA
  • All eval cycles & model swaps owned
  • Monthly executive AI ops report
Talk about managed
Recent Engagement

One client. From $87k/mo Bedrock to $23k/mo self-hosted.

A 12-week AI Infrastructure sprint at an AI-native Series B. Migration from Bedrock to self-hosted vLLM on a 6-GPU H100 cluster. p99 latency improved alongside the cost cut.

AI infrastructure and ML platform
AI Infrastructure vLLM · Triton · AWS H100 12 weeks delivered
From Bedrock dependency to self-hosted Llama-3 70B in 12 weeks
AI-native SaaS · ~60 engineers · Series B

“We were sleepwalking toward a $1.2M annual Bedrock spend. Cloudico’s team rebuilt our inference stack on vLLM in 12 weeks. We didn’t just save money — our p99 latency is half what it was. Customers noticed.”

David Reyes
David Reyes
Head of AI · AI-native SaaS, Series B

The team was running Llama-3 70B equivalents through Bedrock and burning $87k/month at scale. Our 12-week sprint deployed vLLM with PagedAttention on 6 H100 SXM nodes in their AWS account, added a Triton routing layer for mixed-model traffic, shipped a Promptfoo eval harness, and migrated traffic service-by-service over 4 weeks with zero customer-visible regressions. Their team now owns the entire stack.

Stack & tools shipped
vLLMTritonTensorRT-LLMLlama-3 70BKarpenterEKSH100 SXMPromptfooMLflowPrometheusGrafana
−74%
Monthly inference spend reduction
156ms
Median TTFT (was 720ms)
89%
GPU utilization achieved (was 11%)
0
Eval regressions during cutover
Cost per 1M tokens · 12-week migration timeline
Same model class · same traffic shape · different infrastructure
$4.20 baseline $1.08 final
$5 $4 $3 $2 $1 audit done vLLM live tuned W0 W4 W8 W12 +30d
Starting: $4.20/1M tokens (Bedrock) Final: $1.08/1M tokens (tuned vLLM) Total savings: $64k/month recovered Annual: ~$770k/year
Before The Call

The seven questions Heads of AI ask us.

Direct answers to the questions that come up before every AI Infrastructure engagement. Specific to ML platform work — different from the FAQs on the other three service pages.

Ask us directly
Won’t self-hosting be less reliable than Bedrock?
The fear is reasonable. The reality: vLLM and Triton are battle-tested in production at Stripe, Cohere, Meta, and Mistral. When properly deployed on K8s with autoscaling and proper monitoring, self-hosted hits 99.9%+ availability easily. We ship the reliability layer (monitoring, alerting, runbooks, on-call) alongside the inference stack — same discipline as our Reliability Engineering service. If you’ve been told self-hosting is hard, that was true two years ago. It’s not anymore.
What if we need Claude or GPT-4 specifically? We can’t self-host those.
Correct — frontier closed models stay on their APIs. We build a router layer that sends frontier-only workloads to Anthropic/OpenAI/Google APIs and self-hostable workloads to your vLLM cluster. Most clients find that 70–85% of their traffic can run on a self-hosted Llama-3, Mistral, or Qwen model — the remaining 15–30% stays on the frontier APIs for the hardest tasks. The cost math still works dramatically in your favor.
How do we swap models without breaking production?
This is exactly what the eval harness exists for. Before any model swap (Llama-3 to Llama-4, 70B to 7B-fine-tuned, etc.), the new candidate runs through your golden eval set and a shadow traffic stream. The harness blocks the rollout if regressions exceed your tolerance threshold. We deliver the harness with 100+ eval cases pulled from your real traffic. After this, swapping models is a normal engineering operation — not a Big Risky Decision.
We have HIPAA / SOC 2 requirements. Can self-hosted handle that?
Better than Bedrock in many cases. Self-hosted means your data never leaves your VPC, your audit logs are complete, and you control the encryption-at-rest configuration. We deploy in HIPAA-eligible AWS regions with encrypted EBS, KMS-managed keys, and CloudTrail evidence trails. For SOC 2, the same discipline applies. Several of our clients chose self-hosted specifically for compliance reasons — not in spite of them.
Do we need our own GPU cluster, or can we use spot capacity?
Both work, depending on traffic shape. Baseline traffic usually goes on reserved or on-demand H100 / A100 nodes for stable latency. Peak / training can run on spot GPUs (40-70% cheaper) using AWS Spot, GCP Spot, CoreWeave, or RunPod. The Karpenter configuration we ship handles the placement decisions. For most clients, a hybrid setup beats either pure approach.
What about RAG, agents, and vector stores?
In scope when relevant. We ship pgvector or Qdrant for vector storage (Bedrock Knowledge Bases replacement), LangGraph or custom orchestration for agent workflows, and the same observability stack covers RAG pipelines and agent execution traces. For the agent layer specifically, we deliberately avoid Bedrock Agents as a primary because the lock-in cost is too high. Open patterns let you swap providers as the field evolves.
What does pricing actually include?
AI Infrastructure sprints start at $28k fixed-scope for 10–14 weeks, scope confirmed in writing after the audit. Typical range $34–72k depending on model count, traffic volume, and compliance scope. Embedded ML platform retainers from $14k/month, cancel anytime. Managed inference from $28k/month on a quarterly contract. If we miss our committed timeline you don’t pay for the overrun. All pricing is fixed — never variable, never hourly. GPU costs (your AWS/GCP bill) are separate and you keep them in your own account — we never resell compute.
Ready when you are

From a $87k Bedrock bill to an ML platform you own.

Book a 30-minute AI infra call. Senior ML platform engineer on the call. We’ll look at your current stack together and tell you whether self-hosting makes sense for your workload — honestly.

30-min consult Mutual NDA available Read-only access only No obligation