When teams ask us to audit an on-call rotation, the first thing they show is the page-count graph. It’s the easiest metric. It’s also the least useful one.

On-call ergonomics: three numbers beyond page volume
Production context from the Cloudico engineering notebook.

Time to acknowledgement

If the median ack time is >5 minutes, the engineer isn’t sleeping near their phone. They’ve stopped trusting the rotation because of past noise. Fix the noise first; the ack time recovers.

Escalation depth

For every page that gets escalated to a second person, count it as 1.5 pages. If your team is mostly L1 → L2 → CTO, your real load is 50% higher than the count suggests.

Post-incident recovery time

Most teams underweight this. After a 3am page, the engineer’s productive output for the next 36 hours is roughly half. Track “on-call to retro” as a coverage cost.

What good looks like

In a healthy rotation: median ack <2 min, escalation rate <15%, retros within 48 hours of every Sev-1. Get those three right and page volume stops being the limiting factor.

The operating test

We treat this as real only when it changes a dashboard, a runbook, and one named engineer’s weekly work. If the idea cannot survive those three places, it is probably just a slide.

The useful version is specific, measurable, and owned by someone who can say what changed after it shipped.

What we would do differently

  • Instrument before changing architecture. The baseline decides whether the fix worked.
  • Name the trade-off. Every improvement costs latency, money, complexity, or time somewhere else.
  • Revisit it after 30 days. Production has a way of teaching what the workshop missed.