Backup Reality Check

Your stated RPO is wrong.

The RPO you quote is the backup interval. The RPO you will actually experience is the interval plus backup duration, transfer time, and the verification gap you skipped. This page calculates your true data-loss window, generates a tested restore script for your stack, and gives you the verification commands that catch silent failures before 3am does.

Your Setup

e.g. 24 = daily, 1 = hourly, 168 = weekly
Speed to restore destination (LAN: 1000+, WAN: 10–200, S3/cloud: 50–200)
Results
Stated RPO What you tell stakeholders
True RPO (worst case) Backup interval + backup duration + transfer + verify gap
Backup duration Time to complete one full backup at measured speed
Estimated RTO Download + restore + WAL replay (if applicable)
Download time Time to transfer backup to restore host at your egress speed
Restore time Time to complete the database / file restore
WAL replay Additional time for WAL/binlog catchup (WAL tools only)
UNTESTED These numbers are estimates. Until you run a full restore drill and confirm row counts + application queries, your actual RTO is unknown. Untested restores consistently take 3–10x longer than estimated.
Source: tool benchmark references in data.json
PostgreSQL
MySQL
MongoDB
Redis
File trees
← Pick a stack + tool combo to generate a restore script
Verified May 2026

Verifier Cookbook

Post-backup verification is not checking that the backup process exited 0. It is confirming that the backup contains data you can actually restore. These are different things.

What “verified” means

A backup is verified when you have:

  1. Completed a restore to a separate host or directory
  2. Confirmed row/file counts match the source (within expected replication lag)
  3. Confirmed key application queries return correct results
  4. Confirmed the restored instance can accept new writes

A backup you have not restored is not verified. The cron jobs below automate this monthly.


Postgres

pg_dump / pg_restore

# After restore to ${DB}_restore:
psql -d ${DB}_restore -c "SELECT schemaname, tablename, n_live_tup
  FROM pg_stat_user_tables
  ORDER BY n_live_tup DESC LIMIT 20;"

# Confirm no empty tables that should have data:
psql -d ${DB}_restore -c "SELECT tablename FROM pg_stat_user_tables
  WHERE n_live_tup = 0 AND schemaname='public';"

# Confirm sequences are not reset to 1 (collision risk):
psql -d ${DB}_restore -c "SELECT sequence_name, last_value
  FROM information_schema.sequences
  WHERE sequence_schema='public';"

# Run a query that exercises FK constraints:
psql -d ${DB}_restore -c "SELECT COUNT(*) FROM orders JOIN customers ON orders.customer_id = customers.id;"

Silent failure to catch: pg_dump exits 0 even if some objects fail to dump. Always grep stderr for error: before trusting the dump. (pg_dump docs)

WAL-G / Barman / pg_basebackup

# Confirm recovery completed (not still in recovery mode):
psql -c "SELECT pg_is_in_recovery();"
# Must return: f

# Check recovery lag (how far from source):
psql -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"

# Verify WAL continuity:
wal-g wal-verify integrity  # WAL-G
barman check <server>        # Barman

# Attempt a write to confirm instance is in read-write mode:
psql -c "CREATE TABLE _restore_verify (ts timestamptz DEFAULT now()); INSERT INTO _restore_verify DEFAULT VALUES; SELECT * FROM _restore_verify; DROP TABLE _restore_verify;"

Silent failure to catch: restore_command returning non-zero on a missing WAL segment is treated as end-of-archive — Postgres stops replay and proceeds per recovery_target_action (default pause/shutdown, NOT promote). The instance can come up with less data than expected without an obvious error on the missing segment. Always run wal-g wal-verify integrity (or barman check) before relying on a restore. (recovery_target_action, wal-g wal-verify)


MySQL

mysqldump

# Row counts per table:
mysql -D ${DB}_restore -e "SELECT table_name, table_rows
  FROM information_schema.tables
  WHERE table_schema='${DB}_restore'
  ORDER BY table_rows DESC LIMIT 20;"

# Confirm stored procedures restored:
mysql -e "SELECT COUNT(*) FROM information_schema.ROUTINES WHERE ROUTINE_SCHEMA='${DB}_restore';"

# Confirm triggers:
mysql -e "SELECT COUNT(*) FROM information_schema.TRIGGERS WHERE TRIGGER_SCHEMA='${DB}_restore';"

# Test a write:
mysql -D ${DB}_restore -e "CREATE TABLE _verify (id INT AUTO_INCREMENT PRIMARY KEY, ts DATETIME DEFAULT NOW()); INSERT INTO _verify () VALUES (); SELECT * FROM _verify; DROP TABLE _verify;"

Silent failure to catch: mysqldump without --routines silently omits stored procedures and functions. (--triggers IS enabled by default and tied to per-table dumps; the silent omission is --routines and --events.) Also: mysqldump has NO --skip-definer flag (only mysqlpump does) — DEFINER clauses for missing users break specific objects without a top-level error. Workaround: sed -E 's/DEFINER=[^]+@[^]+/DEFINER=CURRENT_USER/g' on the dump. (mysqldump --routines, mysqlpump --skip-definer)

xtrabackup

# After --prepare and --copy-back, before starting MySQL:
ls -la /var/lib/mysql/ | head -20

# After MySQL starts:
mysql -e "SELECT COUNT(*) FROM information_schema.INNODB_TABLES WHERE NAME LIKE '${DB}/%';"

# Confirm xtrabackup metadata:
cat /var/lib/mysql/xtrabackup_info

# Check for InnoDB corruption:
mysqlcheck --all-databases --check --extended 2>&1 | grep -v "OK$"

Silent failure to catch: xtrabackup --prepare on an incremental chain with --apply-log-only omitted on a non-final step makes subsequent incrementals unrecoverable. Sequence: full + each incremental except the last with --apply-log-only; final step with neither flag. Separately: if source used InnoDB tablespace encryption, the keyring file must be backed up alongside the data — without it MySQL refuses to start (InnoDB recovery aborts), it does not return garbage. (Incremental backup prepare, Encrypted InnoDB tablespace backups)


MongoDB

mongodump / mongorestore

# Document counts per collection:
mongosh "$TARGET_URI/${DB}_restore" --eval "
db.getCollectionNames().forEach(function(c) {
  print(c + ': ' + db[c].countDocuments());
});"

# Validate collections (checks BSON integrity):
mongosh "$TARGET_URI/${DB}_restore" --eval "
db.getCollectionNames().forEach(function(c) {
  var r = db.runCommand({validate: c});
  print(c + ': valid=' + r.valid + ' errors=' + r.errors.length);
});"

# Confirm indexes are built:
mongosh "$TARGET_URI/${DB}_restore" --eval "
db.getCollectionNames().forEach(function(c) {
  print(c + ': ' + db[c].getIndexes().length + ' indexes');
});"

Silent failure to catch: mongorestore without --oplogReplay (and a source dump taken with --oplog) is not point-in-time consistent across collections. Documents in collection A may reference documents in collection B written after the dump started. Capped collections AND views ARE preserved — their definitions live in *.metadata.json and are recreated faithfully (the older docs claim that capped-ness is lost is incorrect). (mongodump --oplog, mongorestore)


Redis

RDB snapshot

# Key count:
redis-cli DBSIZE

# Keyspace breakdown by database:
redis-cli INFO keyspace

# Confirm persistence is healthy:
redis-cli INFO persistence | grep -E 'rdb_last_bgsave_status|rdb_last_save_time|aof_enabled'

# Spot-check specific keys:
redis-cli --scan --pattern 'session:*' | head -10 | xargs -I{} redis-cli TTL {}

# Confirm server is writable:
redis-cli SET _restore_verify ok EX 60
redis-cli GET _restore_verify
redis-cli DEL _restore_verify

Silent failure to catch: if AOF is enabled on the target instance, Redis loads the AOF file at startup, not the RDB you just copied. Disable AOF (CONFIG SET appendonly no) before starting Redis after RDB restore, or remove the appendonly.aof file. Also confirm CONFIG GET dir + CONFIG GET dbfilename — if the RDB lands anywhere other than the configured location Redis ignores it and starts empty (DBSIZE=0, no error). (Redis persistence)


File trees

Restic

# Repository integrity (reads random 10% of data chunks):
restic check --read-data-subset 10%

# Snapshot detail:
restic stats latest

# File count in snapshot:
restic ls latest | wc -l

# Spot-check against source:
restic ls latest | shuf | head -100 | while read f; do
  [ -e "$SOURCE$f" ] && echo "OK: $f" || echo "MISSING: $f"
done

# Full restore dry-run:
restic restore latest --target /tmp/restic-verify --dry-run

Silent failure to catch: restic check without --read-data only checks repository structure, not whether chunk contents are correct — bit rot passes structural check. Use --read-data-subset (e.g. 10%) for regular drills, full --read-data annually. On append-only S3 backends, restic forget --prune FAILS with permission denied (it does NOT skip silently) — use restic forget without --prune for retention, then relax the bucket policy briefly when actual prune is needed. (restic check, restic append-only)

Borg

# Repository check with data verification:
borg check --verify-data $BORG_REPO

# Archive info:
borg info $BORG_REPO::latest

# File count:
borg list $BORG_REPO::latest | wc -l

# Extract to tmpfs (fast, memory-only verify):
mount -t tmpfs tmpfs /tmp/borg-verify -o size=2g
cd /tmp/borg-verify
borg extract --list $BORG_REPO::latest
ls -la
umount /tmp/borg-verify

Silent failure to catch: borg prune does not free disk space until borg compact is run (separate step since Borg 1.2). Prune marks segments for deletion; compact reclaims them. Running only prune + checking du shows no change, leading operators to assume prune failed. Always run borg compact after borg prune. Note: Borg 1.x and 2.x repository formats are incompatible — use borg transfer to migrate. (borg compact, Borg 2.0 changes)

rsync

# Count restored files:
find $TARGET -type f | wc -l

# Compare with source (dry-run shows differences):
rsync --dry-run --checksum --itemize-changes -a $SOURCE/ $TARGET/ 2>&1 | head -50

# Check for rsync partial transfer warning in restore log:
grep -iE "error|warning|skipping|failed" /tmp/rsync_restore.log | head -20

# Size comparison:
du -sh $SOURCE $TARGET

Silent failure to catch: rsync exits 23 (partial transfer due to error) or 24 (vanished source files) when files could not be transferred. Note: --ignore-errors only governs whether --delete proceeds past previous I/O errors; it does NOT silence permission-skip exits. Always capture and check the exit code: rsync ... ; echo "rsync exit $?". (rsync exit values, –delete + --ignore-errors)

Snapshots

# Confirm volume is attached and mounted (Nitro instances expose EBS as NVMe):
lsblk | grep -E 'nvme1n1|xvdf'
df -h /mnt/restore

# Database-specific check after mounting:
# Postgres:
pg_ctl start -D /mnt/restore/var/lib/postgresql/data -l /tmp/snap_restore.log
sleep 5
psql -c "SELECT COUNT(*) FROM pg_stat_user_tables;" 2>&1 | head -5
pg_ctl stop -D /mnt/restore/var/lib/postgresql/data

# Files:
find /mnt/restore -type f | wc -l
du -sh /mnt/restore

Silent failure to catch: EBS volumes restored from snapshot (lazy hydration) show as available, but first-read I/O is extremely slow until blocks are hydrated from S3. Under load, application response times degrade by 10-100x for hours. AWS canonical pre-warm command (per Initialize EBS volumes) — note the device on Nitro instances is /dev/nvme1n1 not /dev/xvdf: fio --filename=/dev/nvme1n1 --rw=read --bs=1M --iodepth=32 --ioengine=libaio --direct=1 --name=volume-initialize.


Monthly drill cron pattern

All drills should heartbeat to a monitoring service on success. If the heartbeat does not arrive, the monitoring service alerts. This is the “cron-verify” pattern from the HN thread — the cron runs, but you need confirmation the cron did useful work.

# Pattern: run drill, pipe output to heartbeat
0 3 1 * * /opt/scripts/restore-drill-postgres.sh \
  2>&1 | curl -fsS -m 10 https://hc-ping.io/UUID -d @- > /dev/null

# healthchecks.io: free tier, 20 checks, 5-min check interval
# Cronitor: free tier, 5 checks
# Better Uptime: free tier with cron monitoring

The heartbeat URL receives the script output as the POST body. When the drill runs, you see the output. When it does not run, you get an alert. When the drill fails (exit non-zero), the heartbeat is not sent — also an alert.

What we do for every customer

vmfarms has run managed Postgres for paying customers since 2009. Every customer database includes WAL-G with monthly tested restore drills, checksum verification, and WAL streaming - included in the flat monthly price. We started doing this because we kept seeing the same pattern: teams that thought they had backups, discovered on the worst possible night that they had backup files nobody had ever restored. The restore drill is not optional. It is the only way to know whether your backup is real.