On-call ergonomics: three numbers beyond page volume

When teams ask us to audit an on-call rotation, the first thing they show is the page-count graph. It’s the easiest metric. It’s also the least useful one.

Time to acknowledgement

If the median ack time is >5 minutes, the engineer isn’t sleeping near their phone. They’ve stopped trusting the rotation because of past noise. Fix the noise first; the ack time recovers.

Escalation depth

For every page that gets escalated to a second person, count it as 1.5 pages. If your team is mostly L1 → L2 → CTO, your real load is 50% higher than the count suggests.

Post-incident recovery time

Most teams underweight this. After a 3am page, the engineer’s productive output for the next 36 hours is roughly half. Track “on-call to retro” as a coverage cost.

What good looks like

In a healthy rotation: median ack <2 min, escalation rate <15%, retros within 48 hours of every Sev-1. Get those three right and page volume stops being the limiting factor.

The operating test

We treat this as real only when it changes a dashboard, a runbook, and one named engineer’s weekly work. If the idea cannot survive those three places, it is probably just a slide.

The useful version is specific, measurable, and owned by someone who can say what changed after it shipped.

What we would do differently

Instrument before changing architecture. The baseline decides whether the fix worked.
Name the trade-off. Every improvement costs latency, money, complexity, or time somewhere else.
Revisit it after 30 days. Production has a way of teaching what the workshop missed.