AWS Disaster Recovery — RTO and RPO Planning for Real Workloads

AWS disaster recovery has gotten complicated with all the whitepaper noise flying around. As someone who’s spent years embedded with engineering teams on infrastructure architecture, I learned everything there is to know about the gap between DR plans that exist on paper and DR plans that actually work at 2 a.m. The pattern is almost always the same — there’s a document somewhere, born during a compliance sprint, collecting dust in Confluence. Then something breaks. The document describes infrastructure that was decommissioned eight months ago. Nobody wins. This isn’t a whitepaper. It’s the breakdown I wish I’d had earlier, with real numbers, honest tradeoffs, and zero hand-waving about what these strategies actually cost to run.

Before anything else, two terms you need locked down:

RTO (Recovery Time Objective) — how long you can be down before the business takes genuine damage
RPO (Recovery Point Objective) — how much data loss you can absorb, measured in time

These aren’t technical targets. They’re business decisions. Get them from your product owner or CFO — not from gut instinct buried in a Terraform file.

The Four DR Strategies — Ranked by Cost and Speed

AWS officially documents four disaster recovery strategies in their Well-Architected Framework. The problem is those docs run 60 pages and assume you already know the answer. Here’s the short version — the numbers that actually matter when someone’s asking why the site is down.

Strategy	Typical RTO	Typical RPO	Monthly Cost Range
Backup and Restore	Hours to days	Hours (last snapshot)	$50–$200/mo
Pilot Light	30 min to 2 hours	Minutes to hours	$200–$500/mo
Warm Standby	5–30 minutes	Seconds to minutes	$800–$1,500/mo
Multi-Site Active-Active	Near-zero (<60 sec)	Near-zero	$2,000–$5,000+/mo

Backup and Restore

But what is backup and restore, really? In essence, it’s EBS snapshots, RDS automated backups, and critical data exported to S3 — ideally with cross-region replication switched on. But it’s much more than that for teams who actually think it through. Recovery here means spinning up fresh infrastructure from those snapshots. That takes time. If your RTO is measured in hours and the workload isn’t customer-facing, this is genuinely the right answer. Not a cop-out — a valid, defensible choice.

Pilot Light

The core database and identity systems are always running in the DR region, replicating continuously. Everything else — app servers, load balancers, worker fleets — sits dark. When disaster hits, you run a playbook, ideally automated through AWS Systems Manager or CloudFormation, and it brings the rest of the stack online around that already-running data layer. RTO drops to 30–90 minutes if the playbook has been tested. More on that last part later, and it matters more than people think.

Warm Standby

The full application stack runs in the DR region — just scaled way down. One t3.medium instead of six m5.xlarge instances, sitting quietly behind an ALB that isn’t receiving traffic. When the primary region fails, you scale up and shift DNS. The infrastructure is already proven. You’re just turning the dial on capacity. That’s what makes warm standby endearing to us infrastructure people — it eliminates the “will this even boot” anxiety entirely.

Multi-Site Active-Active

Both regions serve live traffic at the same time. Route 53 splits the load using latency-based or weighted routing. One region fails — the other absorbs it, no manual intervention needed. This is the only strategy where RTO is genuinely near-zero. It’s also the one demanding the most architectural discipline, especially around data consistency. Most teams dramatically underestimate that part.

Matching Strategy to Workload

Probably should have opened with this section, honestly. The real question isn’t “what’s the best DR approach” — it’s “what does this specific workload actually need, and what breaks if it’s wrong.”

Customer-Facing Applications

E-commerce checkouts, SaaS dashboards, patient portals, booking flows — downtime here has immediate revenue and trust consequences. Warm standby is the floor for most of these. Multi-site active-active makes sense when transaction volume or a contractual SLA demands it. An RTO of 30 minutes is survivable for plenty of B2B SaaS products. For a payment processor, that same 30 minutes is a business-ending event.

Internal Tools and Back-Office Systems

HR platforms, internal reporting dashboards, ops tooling — painful when they’re down, but there’s no immediate revenue bleeding out. Pilot light is the right fit. Keep the data layer alive in the secondary region, document the recovery playbook in a Google Doc someone will actually open, and accept that recovery takes 45 minutes to two hours. The cost savings over warm standby are real, and the tradeoff is completely justified.

Data Analytics and Batch Workloads

An EMR cluster running nightly, an Athena-based reporting layer, a Redshift warehouse for monthly analytics — these almost always tolerate hours of downtime without anyone noticing until morning standup. Backup and restore fits here. S3 cross-region replication on your data lake keeps RPO manageable, and the compute layer rebuilds from infrastructure-as-code when you need it.

The Decision Framework

Ask three questions, in this order:

What is the hourly cost of this workload being unavailable? Revenue loss, SLA penalties, staff idle time — put an actual dollar figure on it.
What’s the maximum tolerable downtime before that cost becomes catastrophic?
What’s the monthly DR budget, and which strategy fits inside it?

That third question is where real decisions get made. A warm standby setup running $1,200 a month is easy to justify when one hour of downtime costs $10,000. It’s very hard to justify for an internal tool six people use on Thursdays.

Multi-Region Setup — What It Actually Takes

Frustrated by vague “just replicate your data” advice early in my career, I started documenting the specific AWS services required at each DR tier using a battered Notion notebook and a lot of late-night AWS billing console staring. Here’s what the infrastructure actually looks like — no hand-waving.

Route 53 Health Checks

This is your traffic-routing brain. Configure health checks against your primary region endpoint — typically an HTTP check against an ALB, or a synthetic check via CloudWatch Synthetics. Pair those with failover routing policies in Route 53. Primary check fails, Route 53 routes to the secondary endpoint automatically. Set the health check interval to 10 seconds, failure threshold to 3. That gives you a failover trigger in roughly 30 seconds — fast enough for most RTOs above 5 minutes.

RDS Cross-Region Read Replicas

For relational data, RDS supports cross-region read replicas on MySQL, MariaDB, and PostgreSQL. Replication lag on a well-tuned setup typically runs under 60 seconds for write volumes under 500 MB/hour. During a DR event, you promote the replica to a standalone writable instance — promotion takes 5–10 minutes and breaks the replication link. Aurora Global Database does this faster. Promotion in under 60 seconds, RPO in single-digit seconds. The cost delta is real, but so is the speed difference.

S3 Cross-Region Replication

Enable CRR on every S3 bucket holding application state or customer data — don’t make my mistake of discovering a bucket was missed during an actual recovery. Replication lag for objects under 15 MB is typically under 15 minutes, often much faster. S3 Replication Time Control gives you a contractual 15-minute SLA at roughly $0.015 per GB replicated. For most workloads, standard CRR without RTC is sufficient.

DynamoDB Global Tables

If you’re running DynamoDB, Global Tables give you active-active multi-region replication with sub-second latency between regions. Genuinely one of AWS’s more impressive managed offerings. The tradeoff is cost — every write replicates to every configured region. For a table with 50 million writes a month, adding a second region roughly doubles your DynamoDB write costs. Worth knowing before architecture review.

Application Layer — EC2, ECS, and Lambda

For pilot light, Auto Scaling Groups in the DR region run with desired capacity at zero. Launch templates need to be current — stale templates are how you discover drift during an incident, not before. For warm standby, two t3.medium instances behind an ALB is typically enough to confirm the stack is healthy and ready to scale. Lambda might be the best option for event-driven components, as multi-region deployment requires separate stage configurations per region. That is because Lambda is regional by default and won’t automatically exist in your DR region unless your CI/CD pipeline puts it there.

Testing Your DR Plan — The Part Everyone Skips

I once reviewed a DR runbook that described a VPC CIDR block changed eight months earlier. Confidently wrong, every step of it. Nobody had opened it, let alone tested it. Testing isn’t a once-a-year compliance checkbox — it’s the only mechanism you have for knowing whether any of this actually works.

Game Days

A game day is a structured failure simulation — pick a date, brief the stakeholders, declare a scenario (“us-east-1 is gone”), and execute recovery procedures as if it’s real. Measure actual RTO and RPO, not the theoretical numbers from the architecture doc. Write down every gap between the runbook and reality. Run these quarterly for critical workloads, semi-annually for secondary ones. The first game day is always humbling. Run it anyway.

Automated Failover Testing with Route 53

AWS Route 53 Application Recovery Controller lets you build readiness checks that validate whether your DR environment is actually prepared to receive traffic — resource capacity, replication lag, configuration drift. You can trigger controlled failovers via routing controls without touching production DNS directly. First, you should set this up before you need it — at least if you want failover testing that doesn’t cause an actual incident.

Chaos Engineering — Start Small

AWS Fault Injection Simulator can inject controlled failures — terminate EC2 instances, throttle API calls, simulate AZ-level outages. Start with single-instance failures in staging. Graduate to AZ-level simulations. Only run region-level chaos experiments after your runbooks and automation are proven in smaller tests. FIS experiments are JSON templates and can be wired to CloudWatch alarms that auto-halt if real impact appears.

Frequency Recommendations

Backup restore test — Monthly. Restore a snapshot to a test environment, verify data integrity, confirm the process takes the time you think it does.
Failover test (pilot light / warm standby) — Quarterly. Full runbook execution in a maintenance window with someone timing it.
Active-active validation — Monthly automated checks via ARC, quarterly manual game day.

Real Cost Examples

These numbers come from a representative three-tier web application — one Application Load Balancer, two application servers, one RDS PostgreSQL Multi-AZ instance at db.m5.large, and an S3-backed static asset layer. Estimates based on us-east-1 and us-west-2 pricing as of early 2025. Your actual numbers shift based on data transfer volume and instance selection.

Backup and Restore — Approximately $50/month

RDS automated backups retained for 7 days, EBS snapshots pushed to S3, cross-region replication enabled on a 100 GB data bucket. No compute running in the secondary region — none. The $50 is almost entirely S3 storage and replication transfer costs. RTO runs 4–8 hours if you’ve never practiced recovery. Closer to 2 hours with a documented, tested runbook. That gap is why testing matters.

Pilot Light — Approximately $300/month

The RDS instance runs as a cross-region read replica in us-west-2 — db.t3.medium, approximately $110/month. Route 53 health checks at $2.50/month per check. VPC, subnets, security groups, and launch templates are pre-configured in the DR region, but no EC2 instances are running. The cost delta over backup and restore is almost entirely the read replica. Everything else is configuration.

Warm Standby — Approximately $1,200/month

Two t3.medium application instances running continuously in us-west-2 behind a secondary ALB — roughly $130/month for instances, $20/month for the ALB. The RDS replica is promoted to a db.m5.large running Multi-AZ in the DR region, around $380/month. S3 replication, Route 53, and data transfer fill out the rest. This is where the cost jump hits hardest — going from $300 to $1,200 is a real conversation to have with finance before you build it, not after.

Multi-Site Active-Active — Approximately $2,400/month

Full production-equivalent infrastructure in both regions. Both ALBs serving live traffic via Route 53 latency-based routing. Aurora Global Database replacing the RDS setup — approximately $600/month across both regions for equivalent capacity. DynamoDB Global Tables if session state lives there. ECS or EC2 capacity in both regions sized to independently absorb full traffic during a regional failure. The $2,400 figure assumes 100% overhead — you’re running two production environments, essentially. Some of that cost is offset when traffic naturally distributes across regions and you needed the compute regardless.

The number architects actually need to bring to leadership is the ratio — what does one hour of downtime cost versus what does prevention cost per month? For a workload generating $50,000 per hour in revenue, $2,400 a month on active-active DR is an obvious yes. For an internal analytics tool the finance team uses on Tuesdays, backup and restore at $50 a month is the right answer. Defending anything more expensive would be wasteful, apparently.

Build the comparison. Show the math. Then let the business decide how much risk they’re willing to pay to avoid.