Self-hosted inference on vLLM and Triton. GPU scheduling that actually works on Kubernetes. Fine-tuning pipelines, eval harnesses, observability built in. Cut inference cost 60–75% while improving p99 latency — without giving up reliability or eval rigor.
Almost every AI infrastructure problem we audit started the same way: a researcher built a working POC, the team shipped it, and then production traffic showed up. Here’s the curve, three rows down:
A Jupyter notebook with a wrapped Hugging Face model serves the first 100 customers. Everyone celebrates. The “infrastructure” is one EC2 instance, a Flask wrapper, and a senior ML engineer keeping it alive on weekends.
Customer count crosses 1,000. p99 latency creeps past 3 seconds. Your team Band-Aids it with bigger GPUs — spend doubles. Then doubles again. The ML engineer is now full-time on infra. Half the team migrates to Bedrock to “make it someone else’s problem.”
Monthly Bedrock or Vertex bill crosses $80k. CFO asks “do we even need this LLM?” Engineering says yes. Finance says prove it. Nobody has the eval data to defend the spend. The cost curve is now a board-level conversation. The team is in the worst position to make a calm decision.
These aren’t speculative. They’re the failure modes we see in 9 of every 10 production AI infrastructures — including teams running on Bedrock, Vertex AI, and SageMaker.
A standard Hugging Face pipeline() call keeps your A100 or H100 mostly idle under real traffic. It processes one request sequentially while everything else queues. The fix isn’t buying bigger GPUs — it’s switching from naive serving to vLLM with continuous batching.
Cost per million tokens looks fine at low volume. At scale it eats your margin. Most teams discover this when the monthly bill crosses $80k and finance asks for a justification. By then, the obvious savings move (self-host) feels too risky to attempt mid-quarter.
The median request feels fast. The slow 1% feels broken. Naive serving doesn’t separate time-to-first-token from generation throughput, so optimization targets get mixed up. Result: streaming chat that pauses, agent workflows that time out, and customer churn the dashboard doesn’t predict.
You build on Bedrock Knowledge Bases, then Bedrock Agents, then Bedrock Guardrails. Six months later, switching means rewriting your application layer. The cloud vendor’s pricing power grows every quarter you stay. Most teams don’t realize they’re locked in until they try to leave.
When you change the prompt, the model, or the temperature, nobody can prove it got better. Your team merges based on “feels okay in my testing.” Regressions ship. Customer reports come in. AI without evals is software development without tests — you can ship like that for a while, then it catches up.
Most teams default to the managed services because self-hosting feels scary. Here’s the honest comparison — including where managed actually beats self-hosted.
Bedrock and Vertex AI take infrastructure off your plate completely. The trade-off is that cost scales linearly (and then non-linearly) with usage, and the vendor’s product roadmap dictates what you can use. Great for POCs and low-volume production.
SageMaker and Vertex endpoints let you bring your own model but still tie you to one cloud’s GPU pricing and a default serving container that rarely uses vLLM. Cheaper than Bedrock but the middle ground often gives you the worst of both options.
vLLM with PagedAttention + Triton routing on your Kubernetes. Runs on AWS, GCP, Azure, CoreWeave, or RunPod. Model-agnostic — swap Llama for Mistral or Qwen in hours, not quarters. This is what Stripe, Cohere, Meta, and Mistral run in production.
Every AI Infrastructure engagement covers the same six areas. Depth varies with scope; nothing on this list is optional in a production deployment.
We don’t sell you a tool subscription or wrap your model in our proprietary serving layer. We ship open-source infrastructure (vLLM, Triton, KServe, MLflow) in your repo, configured for your traffic shape and model mix. Your engineers run it after we leave.
vLLM with PagedAttention and continuous batching for chat and streaming. TensorRT-LLM compiled engines for max-throughput batch workloads. Triton on the front for mixed-model routing. Tuned per your traffic shape — not generic defaults.
Karpenter or Cluster Autoscaler tuned for GPU node groups. NVIDIA device plugin, MIG partitioning where useful, spot GPU sourcing on AWS and GCP. Node-level scheduling that keeps utilization above 75% without over-provisioning.
LoRA and QLoRA pipelines on Hugging Face or Unsloth, hyperparameter tuning, base model selection. We ship the pipeline as IaC so your team can re-run it on new data without rebuilding the harness. Includes MLflow tracking for every run.
An eval pipeline using your golden datasets, LLM-as-judge where appropriate, and regression detection on every PR. Promptfoo or DeepEval as the framework. We deliver the harness AND the first 100 eval cases pulled from your real traffic.
Per-request trace logging, TTFT and throughput histograms, GPU utilization by model and tenant, cost-per-1M-tokens dashboards. Built on Prometheus + Grafana — or your existing Datadog if that’s where the team lives.
Per-tenant token budgets, rate limits on expensive routes, automatic fall-through to cheaper models when budgets approach. Cost anomaly detection on every account. Monthly written cost review so the LLM line item never surprises your CFO again.
This is the actual cost-per-1M-token math we walk clients through. Same model (Llama-3 70B), same H100 fleet, different software stack and config. Costs are typical — your numbers will vary by workload shape.
Averaged across our last 11 AI Infrastructure engagements. Your numbers will vary by workload shape, model choice, and traffic profile — but the direction is consistent.
Each engagement runs 10–14 weeks depending on model count and traffic complexity. A senior ML platform engineer is named on day one and stays on your Slack and your calls the entire time.
Read-only audit of your current AI stack: models in use, hosting (Bedrock / Vertex / SageMaker / self-hosted), traffic patterns, p99 latency, GPU utilization, cost per million tokens. We benchmark your current setup so the “after” numbers have a defensible “before.”
Deploy vLLM or Triton in your Kubernetes cluster. Wire up GPU autoscaling. Migrate traffic gradually using a router that splits between old and new. Run shadow traffic for eval-set validation. Ship the fine-tuning pipeline and the eval harness. Cut over service-by-service — never a flag day.
Walkthroughs of every dashboard, runbook, and deployment script. We facilitate the first full eval cycle on your harness and the first model-swap drill. 30 days of post-handover Slack access while your team settles in. After this, your ML platform team owns the stack.
All three start with the baseline audit. What differs is how much we stay around after the migration ships.
Audit, build, migrate, handover. Your team owns inference operations from week 15. Best for teams with strong ML engineers who just need the platform built right once.
After the sprint, a senior ML platform engineer stays embedded part-time. They join your sprint planning, own model rollouts, tune the eval harness as your data grows. Best for teams shipping new models often.
We own the inference stack end-to-end. 24/7 paging on inference incidents, SLA-backed model availability, all model rollouts and eval cycles managed by us. For teams that need production AI ops but can’t justify hiring 3–5 ML platform engineers.
A 12-week AI Infrastructure sprint at an AI-native Series B. Migration from Bedrock to self-hosted vLLM on a 6-GPU H100 cluster. p99 latency improved alongside the cost cut.
“We were sleepwalking toward a $1.2M annual Bedrock spend. Cloudico’s team rebuilt our inference stack on vLLM in 12 weeks. We didn’t just save money — our p99 latency is half what it was. Customers noticed.”
The team was running Llama-3 70B equivalents through Bedrock and burning $87k/month at scale. Our 12-week sprint deployed vLLM with PagedAttention on 6 H100 SXM nodes in their AWS account, added a Triton routing layer for mixed-model traffic, shipped a Promptfoo eval harness, and migrated traffic service-by-service over 4 weeks with zero customer-visible regressions. Their team now owns the entire stack.
Direct answers to the questions that come up before every AI Infrastructure engagement. Specific to ML platform work — different from the FAQs on the other three service pages.
Ask us directlyBook a 30-minute AI infra call. Senior ML platform engineer on the call. We’ll look at your current stack together and tell you whether self-hosting makes sense for your workload — honestly.