AWS’s multi-region docs describe failover as a configuration exercise. In practice the configuration is the easy part. The hard part is everything that breaks when you actually flip the switch.

Multi-region failover is harder than the AWS docs say
Production context from the Cloudico engineering notebook.

1. DNS TTLs lie

You set a 60-second TTL. Resolvers respect it sometimes. Corporate networks, certain ISPs, and aggressive caches will hold your old DNS for 5-15 minutes. Your RTO needs to account for this.

2. Lambda cold-start lag

The passive region’s Lambdas haven’t seen traffic. First few minutes after failover, p99 latency spikes 10×. Provisioned concurrency in passive costs money. Without it, your first 5 minutes look broken.

3. Replication lag

Aurora cross-region replicas have seconds of lag in practice, not the milliseconds in the docs. If a write happens in primary right before the cutover, the read in secondary won’t see it for ~3-8 seconds.

4. Webhook timeouts

Stripe, Twilio, etc. fire webhooks that hit your primary URL. During failover, those return 5xx. The third-party retries with exponential backoff. You miss events. Plan for an outbound webhook replay queue.

5. Your team’s muscle memory

The runbook says “flip the DNS.” When the time comes at 2am, your on-call engineer has never actually done it. Run a real failover quarterly in production windows. The first one will fail. That’s the point.

The operating test

We treat this as real only when it changes a dashboard, a runbook, and one named engineer’s weekly work. If the idea cannot survive those three places, it is probably just a slide.

The useful version is specific, measurable, and owned by someone who can say what changed after it shipped.

What we would do differently

  • Instrument before changing architecture. The baseline decides whether the fix worked.
  • Name the trade-off. Every improvement costs latency, money, complexity, or time somewhere else.
  • Revisit it after 30 days. Production has a way of teaching what the workshop missed.