When teams ask us to audit an on-call rotation, the first thing they show is the page-count graph. It’s the easiest metric. It’s also the least useful one.

Time to acknowledgement
If the median ack time is >5 minutes, the engineer isn’t sleeping near their phone. They’ve stopped trusting the rotation because of past noise. Fix the noise first; the ack time recovers.
Escalation depth
For every page that gets escalated to a second person, count it as 1.5 pages. If your team is mostly L1 → L2 → CTO, your real load is 50% higher than the count suggests.
Post-incident recovery time
Most teams underweight this. After a 3am page, the engineer’s productive output for the next 36 hours is roughly half. Track “on-call to retro” as a coverage cost.
What good looks like
In a healthy rotation: median ack <2 min, escalation rate <15%, retros within 48 hours of every Sev-1. Get those three right and page volume stops being the limiting factor.
The operating test
We treat this as real only when it changes a dashboard, a runbook, and one named engineer’s weekly work. If the idea cannot survive those three places, it is probably just a slide.
The useful version is specific, measurable, and owned by someone who can say what changed after it shipped.
What we would do differently
- Instrument before changing architecture. The baseline decides whether the fix worked.
- Name the trade-off. Every improvement costs latency, money, complexity, or time somewhere else.
- Revisit it after 30 days. Production has a way of teaching what the workshop missed.