Database Backup Strategies for Production: The Ones That Actually Work

Most developers think about backups after they need them. By then it is too late to find out that the backup process was silently failing, that the backup files were corrupted, or that the restore procedure takes six hours and was never tested. A backup that has never been restored is not a backup — it is a ritual that makes you feel better.

This article walks through backup strategies that actually protect your data, and the restore testing that proves they work.

What You Are Actually Protecting Against

Different threats require different backup strategies:

Accidental deletion or corruption. A developer runs DELETE FROM users without a WHERE clause. A bug corrupts rows in the database. You need to recover from a specific point in time before the mistake.

Infrastructure failure. Your cloud provider has an outage, a disk fails, the database instance terminates. You need to restore to a new instance quickly.

Ransomware or security breach. An attacker encrypts or destroys your data. You need an offline copy that cannot be reached by the attacker.

Disaster recovery. Your entire region goes offline. You need to restore in a different region.

Each of these requires a different element of your backup strategy.

Physical vs Logical Backups

Logical backups (pg_dump) export the data as SQL statements. They are database-version independent, human-readable, and easy to restore specific tables or rows from.

Physical backups copy the actual database files. They are faster for large databases, require the same PostgreSQL version to restore, but support point-in-time recovery (PITR) when combined with WAL archiving.

For most production databases, use both:

Logical backups for selective restores and migration safety
Physical backups with WAL archiving for PITR

Setting Up Automated Logical Backups

A simple backup script:

#!/bin/bash
# backup.sh

Set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="backup_${TIMESTAMP}.sql.gz"
S3_BUCKET="s3://your-backup-bucket/postgres"
RETENTION_DAYS=30

# Create backup
pg_dump \
 --format=custom \
 --compress=9 \
 --no-owner \
 --no-acl \
 "${DATABASE_URL}" \
 | gzip > "/tmp/${BACKUP_FILE}"

# Upload to S3
aws s3 cp "/tmp/${BACKUP_FILE}" "${S3_BUCKET}/${BACKUP_FILE}"

# Clean up local file
rm "/tmp/${BACKUP_FILE}"

# Remove backups older than retention period
aws s3 ls "${S3_BUCKET}/" \
 | awk '{print $4}' \
 | sort \
 | head -n -${RETENTION_DAYS} \
 | xargs -I{} aws s3 rm "${S3_BUCKET}/{}"

Echo "Backup completed: ${BACKUP_FILE}"

Schedule with cron (daily at 2am):

0 2 * * * /path/to/backup.sh >> /var/log/db-backup.log 2>&1

Or as a Kubernetes CronJob if you are running in containers:

apiVersion: batch/v1
kind: CronJob
metadata:
 name: postgres-backup
spec:
 schedule: "0 2 * * *"
 jobTemplate:
 spec:
 template:
 spec:
 containers:
 - name: backup
 image: postgres:16
 command: ["/bin/bash", "/scripts/backup.sh"]
 envFrom:
 - secretRef:
 name: postgres-secrets
 restartPolicy: OnFailure

Point-in-Time Recovery With WAL Archiving

WAL (Write-Ahead Log) archiving lets you restore the database to any point in time by replaying log files. Combined with a base backup, this is the most powerful recovery option.

Configure PostgreSQL to archive WAL files:

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://your-wal-bucket/%f'
archive_timeout = 60 # Force WAL switch every 60 seconds

Take a base physical backup periodically:

# Full physical backup with pg_basebackup
pg_basebackup \
 --host=${PGHOST} \
 --username=${PGUSER} \
 --format=tar \
 --gzip \
 --checkpoint=fast \
 --wal-method=stream \
 --output-target-dir=/tmp/base_backup

To restore to a specific point in time:

# Restore the base backup
tar -xzf base_backup.tar.gz -C /var/lib/postgresql/data/

# Create recovery configuration
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'aws s3 cp s3://your-wal-bucket/%f %p'
recovery_target_time = '2026-03-03 14:00:00'
recovery_target_action = 'promote'
EOF

# Start PostgreSQL — it will replay WAL until the target time
pg_ctl start

Managed Database Backup Features

If you are using a managed PostgreSQL service (AWS RDS, Google Cloud SQL, Supabase, Neon), they provide automated backups and PITR out of the box. Understand what is and is not covered:

What managed backups provide:

Automated daily or continuous backups
PITR to any second within the retention window
Cross-region backup replication (usually a paid option)
One-click restore

What they do not protect against:

Application-level data corruption (wrong data written correctly)
Account-level incidents (your cloud account compromised)
Cross-region disasters if backup replication is not enabled

Supplement managed backups with your own logical backups to an independent destination (different cloud provider, different account).

Backup Encryption and Security

Backups containing user data must be encrypted at rest. Most S3-compatible storage supports server-side encryption:

# Encrypt during upload
aws s3 cp backup.sql.gz s3://bucket/backup.sql.gz \
 --server-side-encryption aws:kms \
 --ssekms-key-id your-kms-key-id

For additional protection, encrypt before uploading:

# Encrypt with GPG
gpg --symmetric --cipher-algo AES256 backup.sql.gz
aws s3 cp backup.sql.gz.gpg s3://bucket/backup.sql.gz.gpg

Store encryption keys separately from backups. A backup encrypted with a key that lives in the same compromised system is not protected.

For offline protection against ransomware, store at least one backup copy in a write-once storage tier (S3 Object Lock, or a physically separate offline store) that cannot be modified or deleted by the credentials your application uses.

Retention Policy

A sensible retention policy:

Continuous WAL archiving: 7-30 days (enables PITR within the window)
Daily logical backups: 30 days
Weekly backups: 3 months
Monthly backups: 1 year
Annual backups: 7 years (regulatory requirement for some industries)

Automate retention cleanup — manually managing this is error-prone. S3 lifecycle policies handle this:

{
 "Rules": [
 {
 "Id": "daily-backup-retention",
 "Prefix": "daily/",
 "Status": "Enabled",
 "Expiration": { "Days": 30 }
 },
 {
 "Id": "weekly-backup-retention",
 "Prefix": "weekly/",
 "Status": "Enabled",
 "Expiration": { "Days": 90 }
 }
 ]
}

The Most Important Part: Testing Restores

A backup you have never restored is theoretical. Test your restore procedure quarterly at minimum:

# Restore to a test database
pg_restore \
 --dbname=postgres \
 --create \
 --no-owner \
 --verbose \
 backup.sql.gz

# Verify row counts match production
psql -c "SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;" test_db

Automated restore testing:

#!/bin/bash
# test-restore.sh — Run weekly

BACKUP_FILE=$(aws s3 ls s3://backup-bucket/ | sort | tail -1 | awk '{print $4}')

Aws s3 cp "s3://backup-bucket/${BACKUP_FILE}" /tmp/test-backup.sql.gz

# Create test database
psql -c "DROP DATABASE IF EXISTS restore_test;"
psql -c "CREATE DATABASE restore_test;"

# Restore
pg_restore --dbname=restore_test /tmp/test-backup.sql.gz

# Verify basic integrity
ROW_COUNT=$(psql -t -c "SELECT SUM(n_live_tup) FROM pg_stat_user_tables;" restore_test)
echo "Restored row count: ${ROW_COUNT}"

If [ "${ROW_COUNT}" -lt "1000" ]; then
 echo "WARNING: Suspicious row count after restore"
 # Send alert
fi

# Clean up
psql -c "DROP DATABASE restore_test;"
rm /tmp/test-backup.sql.gz

Echo "Restore test completed successfully"

Monitoring Backup Health

Alert when backups fail or are missing:

# Check that a backup file was created in the last 25 hours
RECENT_BACKUP=$(aws s3 ls s3://backup-bucket/ \
 | awk '{print $4}' \
 | sort \
 | tail -1)

BACKUP_AGE_HOURS=$(python3 -c "
import boto3, datetime
s3 = boto3.client('s3')
obj = s3.head_object(Bucket='backup-bucket', Key='${RECENT_BACKUP}')
age = (datetime.datetime.now(datetime.timezone.utc) - obj['LastModified']).total_seconds() / 3600
print(f'{age:.1f}')
")

If (( $(echo "${BACKUP_AGE_HOURS} > 25" | bc -l) )); then
 echo "ALERT: Most recent backup is ${BACKUP_AGE_HOURS} hours old"
 # Send alert to your monitoring system
fi

Your backup strategy is only as good as the restore you have tested most recently. Build the testing into your operational routine, not as something to do when you remember.

Need help designing a backup and recovery strategy for a production PostgreSQL database, or recovering from a data loss incident? This is exactly the kind of problem I work on with clients. Book a call: calendly.com/jamesrossjr.