Disaster Recovery and Business Continuity on AWS
This content is from the lesson "2.2.2 Disaster Recovery and Business Continuity" in our comprehensive course.
View full course: AWS Solutions Architect Associate Study Notes
Disaster Recovery and Business Continuity are critical components for ensuring that applications remain operational during and after catastrophic events.
This blog covers disaster recovery strategies, business continuity planning, RTO/RPO concepts, immutable infrastructure, and storage options for disaster recovery.
____
How It Works & Core Attributes:

Disaster Recovery Strategies:
DR Fundamentals:
- What Disaster Recovery is: Disaster recovery (DR) is the process of restoring IT infrastructure and systems after a disaster. DR strategies ensure that critical business functions can continue or be quickly restored
- Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. RPO determines how frequently you need to back up your data
- Recovery Time Objective (RTO): The maximum acceptable amount of time to restore a system after a disaster. RTO determines how quickly you need to recover
DR Strategy Types:

- Backup and Restore: The simplest DR strategy where you regularly back up data and restore it when needed. This approach has the longest RTO but is cost-effective
- Pilot Light: A minimal version of your application runs in the cloud. When disaster strikes, you can quickly scale up the pilot light to full capacity
- Warm Standby: A scaled-down version of your application runs in the cloud. This provides faster recovery than pilot light but costs more
__
Advanced DR Strategies:
High-Availability DR:
- Hot Standby: A fully functional copy of your application runs in the cloud. This provides the fastest recovery time but is the most expensive
- Active-Active: Your application runs simultaneously in multiple locations. If one location fails, traffic automatically routes to the other locations
DR Implementation:
- Multi-Region Deployment: Deploying your application across multiple AWS regions to ensure availability even if an entire region fails
- Cross-Region Replication: Automatically copying data to a different region for disaster recovery purposes
__
Immutable Infrastructure:
Immutable Concepts:
- What Immutable Infrastructure is: Immutable infrastructure treats servers and other infrastructure components as disposable. Instead of updating existing components, you replace them with new ones
- Infrastructure as Code: Defining your infrastructure using code (e.g., CloudFormation, Terraform). This ensures consistency and enables automated deployment
- Golden Images: Pre-configured server images that contain all necessary software and configurations. New instances are created from these images
Immutable Benefits:
- Consistency: Every deployment uses the same base image, reducing configuration drift and deployment issues
- Rollback Capability: If a deployment fails, you can quickly roll back to the previous version by deploying the previous image
- Security: Immutable infrastructure reduces the attack surface by eliminating the need to patch running systems
__
Load Balancing for High Availability:

Load Balancer Types:
- Application Load Balancer (ALB): Operates at the application layer and can route traffic based on content. ALB supports path-based routing and host-based routing
- Network Load Balancer (NLB): Operates at the transport layer and provides ultra-high performance. NLB is ideal for TCP/UDP traffic
- Classic Load Balancer (CLB): The legacy load balancer that operates at both application and transport layers
Load Balancer Features:
- Health Checks: Load balancers regularly check the health of targets and only route traffic to healthy instances
- Auto Scaling Integration: Load balancers work with Auto Scaling groups to automatically add or remove instances based on demand
- SSL Termination: Load balancers can handle SSL/TLS termination, reducing the load on your application servers
__
Proxy Concepts and Services:
Proxy Fundamentals:
- What Proxies are: Proxies act as intermediaries between clients and servers. They can provide load balancing, caching, security, and monitoring capabilities
- RDS Proxy: A fully managed database proxy that helps you manage and scale database connections. RDS Proxy can improve application availability and reduce database load
Proxy Benefits:
- Connection Pooling: Proxies can pool database connections, reducing the number of connections your database needs to handle
- Failover Management: Proxies can automatically handle database failover, reducing application downtime
- Security: Proxies can provide an additional layer of security by controlling access to your databases
__
Service Quotas and Throttling:
Quota Management:
- What Service Quotas are: AWS service quotas limit the number of resources you can create or the rate at which you can make API calls. Understanding these limits is crucial for designing highly available systems
- Quota Types: Different types of quotas include API rate limits, resource limits, and account limits. Each service has its own set of quotas
Throttling Strategies:
- Exponential Backoff: A strategy where you gradually increase the delay between retry attempts when you encounter throttling
- Circuit Breaker Pattern: A pattern that prevents cascading failures by temporarily stopping requests when a service is throttling
- Quota Monitoring: Monitoring your quota usage to ensure you don't hit limits during critical operations
__
Workload Visibility and Monitoring:
Visibility Tools:
- What Workload Visibility is: Workload visibility involves monitoring and understanding how your applications are performing and behaving. This includes monitoring metrics, logs, and traces
- AWS X-Ray: A service that helps you analyze and debug distributed applications. X-Ray provides a visual representation of your application's architecture and performance
Monitoring Strategies:
- Metrics Collection: Collecting performance metrics from your applications and infrastructure to understand system behavior
- Log Analysis: Analyzing application and system logs to identify issues and understand application behavior
- Distributed Tracing: Tracking requests as they flow through your distributed system to identify bottlenecks and failures
____
Analogy: A Hospital Emergency Response System
Imagine you're managing a hospital's emergency response system that needs to handle critical situations reliably.
Disaster Recovery Strategies: Your emergency backup systems that ensure critical medical services continue even during power outages or equipment failures. Each strategy has different recovery times and costs.
Immutable Infrastructure: Your standardized medical equipment and procedures that are replaced rather than repaired. This ensures consistency and reduces the risk of configuration errors.
Load Balancing: Your emergency room triage system that distributes patients to available doctors and treatment rooms. The system automatically routes patients to the most appropriate resources.
Proxy Concepts: Your medical coordinators who manage patient flow and ensure efficient resource utilization. They act as intermediaries between patients and medical staff.
Service Quotas: Your hospital's capacity limits and scheduling systems that prevent overcrowding and ensure quality care. The system manages patient flow to stay within operational limits.
Workload Visibility: Your hospital monitoring systems that track patient flow, resource utilization, and response times. This helps identify bottlenecks and improve emergency response efficiency.
____
Common Applications:
- Financial Services: Banking and payment systems that need to maintain operations during disasters
- Healthcare Systems: Medical applications that must remain available for patient care
- E-commerce Platforms: Online stores that need to continue processing orders during outages
- Government Services: Critical government applications that must remain operational
____
Quick Note: The "Disaster Recovery Foundation"
- Define RTO and RPO based on business requirements and cost constraints
- Implement appropriate DR strategies based on your RTO and RPO requirements
- Use immutable infrastructure to ensure consistent and reliable deployments
- Test disaster recovery procedures regularly to ensure they work as expected
- Monitor service quotas and plan for capacity requirements in standby environments
TAGS
Want to learn more?
Check out these related courses to dive deeper into this topic


