Disaster Recovery Planning for Software Systems

Every team thinks about disaster recovery after their first disaster. The database goes down, and it takes four hours to restore from a backup that nobody tested. Or the cloud region experiences an outage, and the application has no cross-region failover because nobody configured it. The disaster itself is bad. Discovering during the disaster that you have no recovery plan is worse.

Disaster recovery planning is not exciting work. It does not ship features. It does not generate revenue. But when something goes catastrophically wrong — and it will — the plan is the difference between a one-hour recovery and a days-long scramble that costs the business far more than the planning would have.

RPO and RTO: Define Your Requirements First

Two numbers define every disaster recovery plan:

Recovery Point Objective (RPO) — how much data loss is acceptable. An RPO of one hour means you can lose up to one hour of data. An RPO of zero means no data loss is acceptable.

Recovery Time Objective (RTO) — how long can the system be down. An RTO of four hours means the application must be operational within four hours of a failure.

These numbers come from the business, not from engineering. The engineering team determines what is technically possible and what it costs. The business decides what the requirements are based on the cost of downtime and data loss.

RPO Backup Strategy Needed Cost
─────────────────────────────────────────────
24 hours Daily backups Low
1 hour Hourly backups + WAL archiving Moderate
Minutes Streaming replication High
Zero Synchronous replication Very high

RTO Recovery Strategy Needed Cost
─────────────────────────────────────────────
24 hours Manual restore from backup Low
4 hours Warm standby + manual failover Moderate
1 hour Hot standby + automated failover High
Minutes Active-active multi-region Very high

Most web applications can tolerate an RPO of 5-15 minutes and an RTO of 1-4 hours. Financial systems and healthcare applications often need RPO near zero and RTO under 15 minutes. Setting these targets before designing the recovery plan prevents both over-engineering (spending money on zero-RPO when an hour is acceptable) and under-engineering (discovering during a crisis that your daily backups lose too much data).

Backup Strategy

Backups are the foundation of disaster recovery. But a backup is worthless if it has never been restored. I cannot emphasize this enough — untested backups are assumptions, not safeguards.

For PostgreSQL databases, a comprehensive backup strategy combines:

Continuous WAL archiving — the write-ahead log captures every change to the database. Archiving WAL segments to object storage enables point-in-time recovery to any moment within the retention window.

# PostgreSQL WAL archiving configuration
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/wal/%f'

Regular base backups — full database dumps or physical copies taken daily or weekly. These provide the starting point for WAL replay during recovery.

Automated restore testing — a scheduled job that restores the latest backup to a test environment and verifies the data is consistent. Run this weekly at minimum. If the restore fails, you want to know now, not during an emergency.

Store backups in a different region and a different account than production. A disaster that takes out your production region should not also take out your backups. Cross-region replication of backup storage is inexpensive insurance.

For application state beyond the database — uploaded files, configuration, secrets — ensure these are backed up with the same rigor. Object storage (S3, R2) provides built-in redundancy, but verify that versioning is enabled so you can recover from accidental deletions. The database replication strategies article covers the real-time side of data protection that complements periodic backups.

Failover Architecture

The failover architecture determines your achievable RTO. Three common patterns:

Cold standby — infrastructure is defined in code but not running. Recovery means provisioning from scratch using your infrastructure as code templates. RTO: hours. Cost: very low (you pay nothing for idle infrastructure).

Warm standby — a smaller replica of your production environment runs continuously. The database replica stays in sync. Application instances are running but at reduced capacity. Recovery means scaling up the standby and redirecting traffic. RTO: 30-60 minutes. Cost: moderate (you pay for reduced-capacity infrastructure).

Hot standby / active-active — a full replica runs in another region, handling read traffic or a subset of write traffic. Recovery means redirecting all traffic to the surviving region. RTO: minutes. Cost: high (you pay for a full second environment).

# Terraform multi-region infrastructure
module "primary" {
 source = "./modules/app-stack"
 region = "us-east-1"
 role = "primary"
}

Module "standby" {
 source = "./modules/app-stack"
 region = "us-west-2"
 role = "standby"

 db_replication_source = module.primary.db_endpoint
}

The failover trigger is as important as the failover architecture. Manual failover requires a human decision, which adds response time but prevents false-positive failovers. Automated failover responds faster but risks triggering on transient issues. For most applications, automated detection with manual confirmation is the right balance — the system alerts you and prepares the failover, but a human approves the switch.

Runbooks and Testing

A disaster recovery plan without runbooks is a set of intentions. When the disaster happens — often at 2 AM, under pressure, with degraded communication — the responder needs step-by-step instructions, not architectural diagrams.

Each failure scenario needs its own runbook:

## Runbook: Primary Database Failure

### Detection
- Alert: DatabasePrimaryDown fires
- Verify: Cannot connect to primary database endpoint

### Recovery Steps
1. Confirm primary is truly down (not a network issue)
 - Check from multiple locations
 - Check cloud provider status page
2. Promote replica to primary
 - `pg_ctl promote -D /var/lib/postgresql/data`
 - Or: trigger automated failover via Patroni
3. Update application configuration
 - Point DATABASE_URL to new primary
 - Restart application pods
4. Verify application health
 - Check health endpoints
 - Verify recent data is present
5. Notify stakeholders
 - Post in #incidents channel
 - Update status page

### Post-Recovery
- Set up new replica from the promoted primary
- Investigate root cause of original failure
- Update this runbook if steps were inaccurate

Test the plan regularly. At minimum, quarterly. Chaos engineering — deliberately injecting failures in a controlled setting — validates that your recovery procedures work and that your team knows how to execute them. Netflix's Chaos Monkey approach (randomly terminating production instances) is one extreme. A more accessible approach is scheduling quarterly "game day" exercises where you simulate a specific failure scenario and execute the recovery runbook.

Every test should produce a retrospective. What worked? What was slower than expected? What step in the runbook was unclear? The runbook improves after every test, and the team's confidence in recovery grows with practice. The teams I have seen handle real disasters best are the ones that practiced recovery regularly — not because the technology was better, but because the humans executing the plan had done it before.

Disaster recovery planning is the ultimate example of work that feels unnecessary until it is the most important thing happening. Invest in it before you need it. The cost of not planning is measured in downtime hours, lost data, and customer trust that takes far longer to rebuild than any infrastructure.