AWS Disaster Recovery — RTO and RPO Planning for Real Workloads

AWS Disaster Recovery — RTO and RPO Planning for Real Workloads

AWS disaster recovery planning is one of those things that every team says they’ve done and almost nobody has actually done well. I’ve spent years working with engineering teams on infrastructure architecture, and the pattern is painfully consistent — there’s a DR document somewhere, it was written during a compliance push, and nobody has opened it since. Then something breaks at 2 a.m. and the document is useless because it describes infrastructure that no longer exists. This guide is not a whitepaper. It’s the practical breakdown I wish I’d had earlier, with real numbers, real tradeoffs, and an honest look at what each strategy actually costs to run.

Before anything else, two terms you need to have locked down:

  • RTO (Recovery Time Objective) — how long you can be down before the business is genuinely damaged
  • RPO (Recovery Point Objective) — how much data loss you can tolerate, measured in time

These aren’t technical targets. They’re business decisions. Get them from your product owner or CFO, not from a gut feeling in a Terraform file.

The Four DR Strategies — Ranked by Cost and Speed

AWS officially documents four disaster recovery strategies in their Well-Architected Framework. The problem is those docs are 60 pages long and written for architects who already know the answer. Here’s the short version with the numbers that actually matter.

Strategy Typical RTO Typical RPO Monthly Cost Range
Backup and Restore Hours to days Hours (last snapshot) $50–$200/mo
Pilot Light 30 min to 2 hours Minutes to hours $200–$500/mo
Warm Standby 5–30 minutes Seconds to minutes $800–$1,500/mo
Multi-Site Active-Active Near-zero (<60 sec) Near-zero $2,000–$5,000+/mo

Backup and Restore

This is the most basic tier. You’re taking EBS snapshots, RDS automated backups, and exporting critical data to S3 — ideally with cross-region replication turned on. Recovery means spinning up new infrastructure from those snapshots, which takes time. If your RTO is measured in hours and your workload is non-critical, this is genuinely the right answer. Not a cop-out. A valid choice.

Pilot Light

The core database and identity systems are always running in the DR region, replicating continuously. Everything else — the application servers, the load balancers, the worker fleets — is off. When disaster strikes, you run a playbook (ideally automated with AWS Systems Manager or CloudFormation) that brings the rest of the stack online around the already-running data layer. RTO drops to 30–90 minutes if the playbook is tested. More on that last part later.

Warm Standby

The full application stack runs in the DR region, but at reduced capacity. Think one t3.medium instead of six m5.xlarge instances. Traffic isn’t hitting it during normal operations. When the primary region fails, you scale up and shift traffic. The infrastructure is already proven. You’re just changing the dial on capacity and DNS.

Multi-Site Active-Active

Both regions serve live traffic simultaneously. Route 53 distributes load between them using latency-based or weighted routing. Failure in one region is absorbed by the other with no manual intervention required. This is the only strategy where RTO is genuinely near-zero. It’s also the one that requires the most architectural discipline — especially around data consistency, which is where most teams underestimate the complexity.

Matching Strategy to Workload

Probably should have opened with this section, honestly. The strategy selection question isn’t “what’s the best DR approach” — it’s “what does this specific workload actually need.”

Customer-Facing Applications

If customers interact with your application directly — e-commerce, SaaS dashboards, patient portals, booking systems — downtime has immediate revenue and trust implications. Warm standby is the floor for most of these. Multi-site active-active is appropriate when the transaction volume or contractual SLA demands it. An RTO of 30 minutes is survivable for a lot of B2B SaaS products. An RTO of 30 minutes for a payment processor is a business-ending event.

Internal Tools and Back-Office Systems

HR platforms, internal reporting dashboards, ops tooling — these hurt when they’re down, but they don’t generate an immediate revenue impact. Pilot light is the right fit here. Keep the data layer alive in the secondary region, document the recovery playbook, and accept that recovery will take 45 minutes to two hours. The cost savings over warm standby are significant and the tradeoff is justified.

Data Analytics and Batch Workloads

An EMR cluster that runs nightly, an Athena-based reporting layer, a Redshift warehouse for monthly analytics — these can almost always tolerate hours of downtime. Backup and restore is appropriate. S3 cross-region replication on your data lake keeps your RPO manageable, and the compute layer can be rebuilt from infrastructure-as-code when needed.

The Decision Framework

Ask three questions in order:

  1. What is the hourly cost of this workload being unavailable? (Revenue loss, SLA penalties, staff idle time — put a number on it.)
  2. What is the maximum tolerable downtime before that cost becomes catastrophic?
  3. What is the monthly DR budget, and which strategy fits within it?

That third question is where real decisions get made. A warm standby setup that costs $1,200 a month is easy to justify for a workload where one hour of downtime costs $10,000. It’s hard to justify for an internal tool used by six people.

Multi-Region Setup — What It Actually Takes

Burned by vague “just replicate your data” advice early in my career, I started documenting the specific AWS services required for each DR tier. Here’s what the infrastructure actually looks like.

Route 53 Health Checks

This is your traffic-routing brain. Configure health checks against your primary region endpoint — typically a simple HTTP check against an ALB or a synthetic check using CloudWatch Synthetics. Pair health checks with failover routing policies in Route 53. When the primary check fails, Route 53 automatically routes to your secondary endpoint. Set the health check interval to 10 seconds and the failure threshold to 3. That gives you a failover trigger in about 30 seconds.

RDS Cross-Region Read Replicas

For relational data, RDS supports cross-region read replicas for MySQL, MariaDB, and PostgreSQL engines. Replication lag on a well-tuned setup typically runs under 60 seconds for write volumes under 500 MB/hour. During a DR event, you promote the replica to a standalone instance — this breaks the replication link and makes the replica writable. Promotion takes 5–10 minutes. Aurora Global Database does this faster, with promotion in under 60 seconds and RPO measured in single-digit seconds.

S3 Cross-Region Replication

Enable CRR on every S3 bucket that holds application state or customer data. Replication lag for objects under 15 MB is typically under 15 minutes, often much faster. S3 Replication Time Control (S3 RTC) gives you a contractual 15-minute SLA if you need it — at an added cost of approximately $0.015 per GB replicated. For most workloads, standard CRR without RTC is sufficient.

DynamoDB Global Tables

If you’re running DynamoDB, Global Tables give you active-active multi-region replication with sub-second latency between regions. This is one of AWS’s genuinely impressive managed offerings. The tradeoff is cost — Global Tables replicate every write to every configured region, so write costs multiply by the number of regions. For a table with 50 million writes per month, adding a second region roughly doubles your DynamoDB write costs.

Application Layer — EC2, ECS, and Lambda

For pilot light, your Auto Scaling Groups in the DR region should have a desired capacity of zero. Your launch templates need to be current. For warm standby, run the minimum viable cluster — two t3.medium instances behind an ALB is often enough to confirm the stack is healthy and ready to scale. Lambda functions are regional — deploy them in both regions as part of your standard CI/CD pipeline using separate stage configurations per region.

Testing Your DR Plan — The Part Everyone Skips

I once reviewed a DR runbook that described a VPC CIDR block that had been changed eight months earlier. Nobody had tested it. The runbook was confidently wrong. Testing is not optional and not a once-a-year checkbox — it’s the only way to know your DR plan works.

Game Days

A game day is a structured failure simulation. Pick a date, notify stakeholders, define a scenario (“us-east-1 is unavailable”), and execute your recovery procedures as if it were real. Measure actual RTO and RPO, not theoretical ones. Document every gap between the runbook and reality. Schedule these quarterly for critical workloads, semi-annually for secondary ones.

Automated Failover Testing with Route 53

AWS Route 53 Application Recovery Controller lets you build readiness checks that validate whether your DR environment is actually prepared to receive traffic — checking things like resource capacity, replication lag, and configuration drift. You can trigger controlled failovers using routing controls without touching production DNS directly. This is the right way to test failover without causing an actual incident.

Chaos Engineering — Start Small

AWS Fault Injection Simulator (FIS) can inject controlled failures — terminate EC2 instances, throttle API calls, simulate AZ-level failures. Start with single-instance failures in your staging environment. Graduate to AZ-level simulations. Only run region-level chaos experiments when your runbooks and automation are proven. FIS experiments are configured as JSON templates and can be tied to CloudWatch alarms that automatically halt the experiment if real impact is detected.

Frequency Recommendations

  • Backup restore test — Monthly. Restore a snapshot to a test environment and verify data integrity.
  • Failover test (pilot light / warm standby) — Quarterly. Full runbook execution in a maintenance window.
  • Active-active validation — Monthly automated checks via ARC, quarterly manual game day.

Real Cost Examples

These numbers are based on a representative three-tier web application — an Application Load Balancer, two application servers, one RDS PostgreSQL Multi-AZ instance (db.m5.large), and an S3-backed static asset layer. All figures are estimates based on us-east-1 and us-west-2 pricing as of early 2025. Your actual numbers will vary based on data transfer volume and instance choices.

Backup and Restore — Approximately $50/month

RDS automated backups retained for 7 days, EBS snapshots sent to S3, S3 cross-region replication enabled on a 100 GB data bucket. No compute running in the secondary region. The $50 is almost entirely S3 storage and replication data transfer costs. RTO is 4–8 hours if you’ve never practiced recovery. Closer to 2 hours with a documented and tested runbook.

Pilot Light — Approximately $300/month

The RDS instance runs as a cross-region read replica in us-west-2 (db.t3.medium, approximately $110/month). Route 53 health checks are configured ($2.50/month per check). VPC, subnets, security groups, and launch templates are pre-configured in the DR region but no EC2 instances are running. The cost delta over backup/restore is primarily the read replica.

Warm Standby — Approximately $1,200/month

Two t3.medium application instances running continuously in us-west-2 behind a secondary ALB (~$130/month for instances, ~$20/month for the ALB). The RDS replica is promoted to a db.m5.large running Multi-AZ in the DR region (~$380/month). S3 replication, Route 53, and data transfer round out the rest. This is where the cost jump is most jarring for teams who haven’t planned for it — going from $300 to $1,200 is a real conversation to have with finance before you build it.

Multi-Site Active-Active — Approximately $2,400/month

Full production-equivalent infrastructure in both regions. Both ALBs serving live traffic via Route 53 latency-based routing. Aurora Global Database replacing the RDS setup (~$600/month across both regions for equivalent capacity). DynamoDB Global Tables if session state is stored there. ECS or EC2 capacity in both regions sized to handle full traffic independently during a regional failure. The $2,400 figure assumes 100% overhead — you’re essentially running two production environments. Some of that cost is offset if your traffic naturally distributes across regions and you were going to need the compute anyway.

The number architects actually need to bring to leadership is the ratio: what does one hour of downtime cost versus what does it cost per month to prevent it? For a workload generating $50,000 per hour in revenue, spending $2,400 a month on active-active DR is an obvious decision. For an internal analytics tool used by the finance team on Tuesdays, backup and restore at $50 a month is the right answer and defending anything more expensive would be wasteful.

Build the comparison. Show the math. Then let the business decide how much risk they’re willing to pay to avoid.

Marcus Chen

Marcus Chen

Author & Expert

Robert Chen specializes in military network security and identity management. He writes about PKI certificates, CAC reader troubleshooting, and DoD enterprise tools based on hands-on experience supporting military IT infrastructure.

34 Articles
View All Posts

Stay in the loop

Get the latest team aws updates delivered to your inbox.