Backup Restore Reality Check — RPO/RTO Calculator + Drill Generator

Verified May 2026

Verifier Cookbook

Post-backup verification is not checking that the backup process exited 0. It is confirming that the backup contains data you can actually restore. These are different things.

What “verified” means

A backup is verified when you have:

Completed a restore to a separate host or directory
Confirmed row/file counts match the source (within expected replication lag)
Confirmed key application queries return correct results
Confirmed the restored instance can accept new writes

A backup you have not restored is not verified. The cron jobs below automate this monthly.

Postgres

pg_dump / pg_restore

# After restore to ${DB}_restore:
psql -d ${DB}_restore -c "SELECT schemaname, tablename, n_live_tup
  FROM pg_stat_user_tables
  ORDER BY n_live_tup DESC LIMIT 20;"

# Confirm no empty tables that should have data:
psql -d ${DB}_restore -c "SELECT tablename FROM pg_stat_user_tables
  WHERE n_live_tup = 0 AND schemaname='public';"

# Confirm sequences are not reset to 1 (collision risk):
psql -d ${DB}_restore -c "SELECT sequence_name, last_value
  FROM information_schema.sequences
  WHERE sequence_schema='public';"

# Run a query that exercises FK constraints:
psql -d ${DB}_restore -c "SELECT COUNT(*) FROM orders JOIN customers ON orders.customer_id = customers.id;"

Silent failure to catch: pg_dump exits 0 even if some objects fail to dump. Always grep stderr for error: before trusting the dump. (pg_dump docs)

WAL-G / Barman / pg_basebackup

# Confirm recovery completed (not still in recovery mode):
psql -c "SELECT pg_is_in_recovery();"
# Must return: f

# Check recovery lag (how far from source):
psql -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"

# Verify WAL continuity:
wal-g wal-verify integrity  # WAL-G
barman check <server>        # Barman

# Attempt a write to confirm instance is in read-write mode:
psql -c "CREATE TABLE _restore_verify (ts timestamptz DEFAULT now()); INSERT INTO _restore_verify DEFAULT VALUES; SELECT * FROM _restore_verify; DROP TABLE _restore_verify;"

Silent failure to catch: restore_command returning non-zero on a missing WAL segment is treated as end-of-archive — Postgres stops replay and proceeds per recovery_target_action (default pause/shutdown, NOT promote). The instance can come up with less data than expected without an obvious error on the missing segment. Always run wal-g wal-verify integrity (or barman check) before relying on a restore. (recovery_target_action, wal-g wal-verify)

MySQL

mysqldump

# Row counts per table:
mysql -D ${DB}_restore -e "SELECT table_name, table_rows
  FROM information_schema.tables
  WHERE table_schema='${DB}_restore'
  ORDER BY table_rows DESC LIMIT 20;"

# Confirm stored procedures restored:
mysql -e "SELECT COUNT(*) FROM information_schema.ROUTINES WHERE ROUTINE_SCHEMA='${DB}_restore';"

# Confirm triggers:
mysql -e "SELECT COUNT(*) FROM information_schema.TRIGGERS WHERE TRIGGER_SCHEMA='${DB}_restore';"

# Test a write:
mysql -D ${DB}_restore -e "CREATE TABLE _verify (id INT AUTO_INCREMENT PRIMARY KEY, ts DATETIME DEFAULT NOW()); INSERT INTO _verify () VALUES (); SELECT * FROM _verify; DROP TABLE _verify;"

Silent failure to catch: mysqldump without --routines silently omits stored procedures and functions. (--triggers IS enabled by default and tied to per-table dumps; the silent omission is --routines and --events.) Also: mysqldump has NO --skip-definer flag (only mysqlpump does) — DEFINER clauses for missing users break specific objects without a top-level error. Workaround: sed -E 's/DEFINER=[^]+@[^]+/DEFINER=CURRENT_USER/g' on the dump. (mysqldump --routines, mysqlpump --skip-definer)

xtrabackup

# After --prepare and --copy-back, before starting MySQL:
ls -la /var/lib/mysql/ | head -20

# After MySQL starts:
mysql -e "SELECT COUNT(*) FROM information_schema.INNODB_TABLES WHERE NAME LIKE '${DB}/%';"

# Confirm xtrabackup metadata:
cat /var/lib/mysql/xtrabackup_info

# Check for InnoDB corruption:
mysqlcheck --all-databases --check --extended 2>&1 | grep -v "OK$"

Silent failure to catch: xtrabackup --prepare on an incremental chain with --apply-log-only omitted on a non-final step makes subsequent incrementals unrecoverable. Sequence: full + each incremental except the last with --apply-log-only; final step with neither flag. Separately: if source used InnoDB tablespace encryption, the keyring file must be backed up alongside the data — without it MySQL refuses to start (InnoDB recovery aborts), it does not return garbage. (Incremental backup prepare, Encrypted InnoDB tablespace backups)

MongoDB

mongodump / mongorestore

# Document counts per collection:
mongosh "$TARGET_URI/${DB}_restore" --eval "
db.getCollectionNames().forEach(function(c) {
  print(c + ': ' + db[c].countDocuments());
});"

# Validate collections (checks BSON integrity):
mongosh "$TARGET_URI/${DB}_restore" --eval "
db.getCollectionNames().forEach(function(c) {
  var r = db.runCommand({validate: c});
  print(c + ': valid=' + r.valid + ' errors=' + r.errors.length);
});"

# Confirm indexes are built:
mongosh "$TARGET_URI/${DB}_restore" --eval "
db.getCollectionNames().forEach(function(c) {
  print(c + ': ' + db[c].getIndexes().length + ' indexes');
});"

Silent failure to catch: mongorestore without --oplogReplay (and a source dump taken with --oplog) is not point-in-time consistent across collections. Documents in collection A may reference documents in collection B written after the dump started. Capped collections AND views ARE preserved — their definitions live in *.metadata.json and are recreated faithfully (the older docs claim that capped-ness is lost is incorrect). (mongodump --oplog, mongorestore)

Redis

RDB snapshot

# Key count:
redis-cli DBSIZE

# Keyspace breakdown by database:
redis-cli INFO keyspace

# Confirm persistence is healthy:
redis-cli INFO persistence | grep -E 'rdb_last_bgsave_status|rdb_last_save_time|aof_enabled'

# Spot-check specific keys:
redis-cli --scan --pattern 'session:*' | head -10 | xargs -I{} redis-cli TTL {}

# Confirm server is writable:
redis-cli SET _restore_verify ok EX 60
redis-cli GET _restore_verify
redis-cli DEL _restore_verify

Silent failure to catch: if AOF is enabled on the target instance, Redis loads the AOF file at startup, not the RDB you just copied. Disable AOF (CONFIG SET appendonly no) before starting Redis after RDB restore, or remove the appendonly.aof file. Also confirm CONFIG GET dir + CONFIG GET dbfilename — if the RDB lands anywhere other than the configured location Redis ignores it and starts empty (DBSIZE=0, no error). (Redis persistence)

File trees

Restic

# Repository integrity (reads random 10% of data chunks):
restic check --read-data-subset 10%

# Snapshot detail:
restic stats latest

# File count in snapshot:
restic ls latest | wc -l

# Spot-check against source:
restic ls latest | shuf | head -100 | while read f; do
  [ -e "$SOURCE$f" ] && echo "OK: $f" || echo "MISSING: $f"
done

# Full restore dry-run:
restic restore latest --target /tmp/restic-verify --dry-run

Silent failure to catch: restic check without --read-data only checks repository structure, not whether chunk contents are correct — bit rot passes structural check. Use --read-data-subset (e.g. 10%) for regular drills, full --read-data annually. On append-only S3 backends, restic forget --prune FAILS with permission denied (it does NOT skip silently) — use restic forget without --prune for retention, then relax the bucket policy briefly when actual prune is needed. (restic check, restic append-only)

Borg

# Repository check with data verification:
borg check --verify-data $BORG_REPO

# Archive info:
borg info $BORG_REPO::latest

# File count:
borg list $BORG_REPO::latest | wc -l

# Extract to tmpfs (fast, memory-only verify):
mount -t tmpfs tmpfs /tmp/borg-verify -o size=2g
cd /tmp/borg-verify
borg extract --list $BORG_REPO::latest
ls -la
umount /tmp/borg-verify

Silent failure to catch: borg prune does not free disk space until borg compact is run (separate step since Borg 1.2). Prune marks segments for deletion; compact reclaims them. Running only prune + checking du shows no change, leading operators to assume prune failed. Always run borg compact after borg prune. Note: Borg 1.x and 2.x repository formats are incompatible — use borg transfer to migrate. (borg compact, Borg 2.0 changes)

rsync

# Count restored files:
find $TARGET -type f | wc -l

# Compare with source (dry-run shows differences):
rsync --dry-run --checksum --itemize-changes -a $SOURCE/ $TARGET/ 2>&1 | head -50

# Check for rsync partial transfer warning in restore log:
grep -iE "error|warning|skipping|failed" /tmp/rsync_restore.log | head -20

# Size comparison:
du -sh $SOURCE $TARGET

Silent failure to catch: rsync exits 23 (partial transfer due to error) or 24 (vanished source files) when files could not be transferred. Note: --ignore-errors only governs whether --delete proceeds past previous I/O errors; it does NOT silence permission-skip exits. Always capture and check the exit code: rsync ... ; echo "rsync exit $?". (rsync exit values, –delete + --ignore-errors)

Snapshots

# Confirm volume is attached and mounted (Nitro instances expose EBS as NVMe):
lsblk | grep -E 'nvme1n1|xvdf'
df -h /mnt/restore

# Database-specific check after mounting:
# Postgres:
pg_ctl start -D /mnt/restore/var/lib/postgresql/data -l /tmp/snap_restore.log
sleep 5
psql -c "SELECT COUNT(*) FROM pg_stat_user_tables;" 2>&1 | head -5
pg_ctl stop -D /mnt/restore/var/lib/postgresql/data

# Files:
find /mnt/restore -type f | wc -l
du -sh /mnt/restore

Silent failure to catch: EBS volumes restored from snapshot (lazy hydration) show as available, but first-read I/O is extremely slow until blocks are hydrated from S3. Under load, application response times degrade by 10-100x for hours. AWS canonical pre-warm command (per Initialize EBS volumes) — note the device on Nitro instances is /dev/nvme1n1 not /dev/xvdf: fio --filename=/dev/nvme1n1 --rw=read --bs=1M --iodepth=32 --ioengine=libaio --direct=1 --name=volume-initialize.

Monthly drill cron pattern

All drills should heartbeat to a monitoring service on success. If the heartbeat does not arrive, the monitoring service alerts. This is the “cron-verify” pattern from the HN thread — the cron runs, but you need confirmation the cron did useful work.

# Pattern: run drill, pipe output to heartbeat
0 3 1 * * /opt/scripts/restore-drill-postgres.sh \
  2>&1 | curl -fsS -m 10 https://hc-ping.io/UUID -d @- > /dev/null

# healthchecks.io: free tier, 20 checks, 5-min check interval
# Cronitor: free tier, 5 checks
# Better Uptime: free tier with cron monitoring

The heartbeat URL receives the script output as the POST body. When the drill runs, you see the output. When it does not run, you get an alert. When the drill fails (exit non-zero), the heartbeat is not sent — also an alert.

Your stated RPO is wrong.

Your Setup