Recovering a Database Everyone Thought Was Lost
A production database crashed hard. No backups. Corrupted files. Thousands of users locked out. Here's how we brought it back with zero data loss.
The Worst Morning Call
"The entire system is down. Nothing works."
That's never the message you want to receive. A SaaS platform serving thousands of daily users was completely offline. The database wouldn't start — it would try, crash, try again, crash again. An endless loop of failure.
This is the story of how we recovered a database that seemed unrecoverable, and the hard lessons that came with it.
Understanding the Damage
The database logs revealed the problem immediately:
PANIC: invalid lp
LOG: startup process (PID xxx) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
This is PostgreSQL telling you the Write-Ahead Log (WAL) is corrupted. The WAL is how PostgreSQL ensures data consistency — every change is logged before it's applied. When WAL is corrupted, the database can't determine what state it should be in, so it refuses to start.
The situation:
- Database in continuous crash loop
- All APIs returning 500 errors
- Thousands of users affected
- Mobile apps completely non-functional
The infrastructure gap:
- No automated backup system
- No streaming replica
- WAL archiving not enabled
- Last manual backup was weeks old
First Rule: Don't Make It Worse
Before touching the production server, we created a complete backup of the corrupted volume:
# Create a backup of the Docker volume
docker run --rm \
-v db-data:/data \
-v /backup:/backup \
alpine tar czf /backup/db-backup-$(date +%Y%m%d).tar.gz -C /data .
# Create a second safety copy
cp /backup/db-backup-*.tar.gz /backup/db-backup-ORIGINAL.tar.gz
We then set up an isolated testing environment:
# Create test container with backup data
docker run -d \
--name db-recovery-test \
-v /backup/test-data:/var/lib/postgresql/data \
postgres:15
Every fix we tried, we tried in the test environment first.
The Recovery Process
Step 1: Assess the Damage
First, we needed to understand what was actually corrupted:
# Check PostgreSQL data directory integrity
pg_controldata /var/lib/postgresql/data
# Look for corrupted multixact files
ls -la /var/lib/postgresql/data/pg_multixact/members/
ls -la /var/lib/postgresql/data/pg_multixact/offsets/
The pg_controldata output showed mismatched transaction IDs and corrupted multixact entries — files that track row-level locking across multiple transactions.
Step 2: Reset the Write-Ahead Log
PostgreSQL provides pg_resetwal as a last-resort recovery tool. It essentially tells the database "forget the transaction log and start fresh."
# Stop any running instances
docker service scale db-service=0
# Reset WAL with corrected parameters
# These values must be calculated from pg_controldata output
pg_resetwal -f \
-m [next-multixact-id],[oldest-multixact-id] \
-O [next-oid] \
/var/lib/postgresql/data
Warning: Getting these parameters wrong can cause additional corruption. We calculated them from:
- The
pg_controldataoutput - Analysis of existing multixact files
- Comparison with a known-good staging database
For the corrupted multixact files, we had to extend them:
# Extend corrupted multixact member file
dd if=/dev/zero bs=4096 count=10 >> pg_multixact/members/0000
# Extend corrupted multixact offset file
dd if=/dev/zero bs=4096 count=5 >> pg_multixact/offsets/0000
After these fixes, the database started successfully.
Step 3: Identify Corrupted Tables
The database was running, but we needed to check data integrity:
-- List all tables and try to count rows
SELECT schemaname, tablename
FROM pg_tables
WHERE schemaname = 'public';
-- Test each table
SELECT COUNT(*) FROM table_name;
Most tables returned results instantly. But one critical table — containing core user records — would hang indefinitely. The data pages themselves were corrupted.
Step 4: Rebuild the Corrupted Table
We needed the data from that table, but couldn't read it directly. Fortunately, the application synced with an external API that contained the same records.
# Fetch records from external API
curl -X POST "https://external-api/endpoint" \
-H "Authorization: Basic [token]" \
-d '{"query": "all_records"}' \
> recovered_data.json
# Convert to SQL insert format
jq -r '.records[] | @csv' recovered_data.json > recovered_data.csv
Then we rebuilt the table:
-- Create new table with correct schema
CREATE TABLE new_records (LIKE old_records INCLUDING ALL);
-- Import recovered data
COPY new_records FROM '/tmp/recovered_data.csv' WITH CSV;
-- Create placeholder records for historical entries
-- (to maintain foreign key integrity)
INSERT INTO new_records (id, name, status, ...)
SELECT DISTINCT referenced_id, 'Archived Record', 'archived', ...
FROM other_table
WHERE referenced_id NOT IN (SELECT id FROM new_records);
-- Swap tables
ALTER TABLE old_records RENAME TO old_records_corrupted;
ALTER TABLE new_records RENAME TO records;
-- Restore foreign key constraints
ALTER TABLE enrollments
ADD CONSTRAINT fk_record
FOREIGN KEY (record_id) REFERENCES records(id);
-- Rebuild indexes
CREATE INDEX idx_records_status ON records(status);
CREATE INDEX idx_records_created ON records(created_at);
Step 5: Verify Integrity
We compared the recovered database against our staging environment:
-- Compare table counts
SELECT
(SELECT COUNT(*) FROM information_schema.tables
WHERE table_schema = 'public') as prod_tables,
-- Compare with staging...
-- Check for orphaned foreign keys
SELECT COUNT(*)
FROM child_table c
LEFT JOIN parent_table p ON c.parent_id = p.id
WHERE p.id IS NULL;
-- Verify index count matches
SELECT COUNT(*) FROM pg_indexes WHERE schemaname = 'public';
Results:
- 45 tables recovered (44 from backup, 1 rebuilt)
- All indexes restored
- All foreign key constraints valid
- Zero orphaned records
- Zero data loss
Prevention: Making This Not Happen Again
Automated Daily Backups
#!/bin/bash
# /etc/cron.daily/pg-backup
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=/backups/postgresql
CONTAINER=$(docker ps -q -f name=db-service)
# Create compressed backup
docker exec -u postgres $CONTAINER \
pg_dump -Fc database_name > $BACKUP_DIR/backup-$DATE.dump
# Verify backup integrity
pg_restore --list $BACKUP_DIR/backup-$DATE.dump > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "Backup verification failed!" | mail -s "BACKUP ALERT" admin@company.com
fi
# Retain 30 days
find $BACKUP_DIR -name "*.dump" -mtime +30 -delete
Enable WAL Archiving
Added to postgresql.conf:
# Enable continuous archiving
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'
wal_level = replica
# This enables Point-in-Time Recovery (PITR)
# You can restore to any moment, not just the last backup
Docker Health Checks
# docker-compose.yml or swarm service definition
healthcheck:
test: ["CMD-SHELL", "pg_isready -U dbuser -d dbname"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
Monitoring Alerts
Configured alerts for:
- Connection failures
- Disk space below 20%
- Long-running queries (> 60 seconds)
- Lock wait times (> 30 seconds)
- Replication lag (when replica is added)
Timeline Summary
| Time | Event |
|---|---|
| 09:00 | Database enters crash loop |
| 09:15 | Investigation begins, PANIC errors identified |
| 10:00 | Decision to backup before repair |
| 10:30 | Backup complete (371MB) |
| 11:00 | Local test environment created |
| 12:00 | pg_resetwal fix verified locally |
| 13:00 | Corrupted table identified |
| 14:00 | External API data retrieved |
| 14:45 | Production fix applied |
| 15:00 | Service fully restored |
Total downtime: ~4 hours
Data loss: Zero
Lessons Learned
1. Backups Are Not Optional
"We'll set up backups soon" is how you end up with no backups when you need them. The cost of daily automated backups is trivial compared to the cost of an outage.
2. Test Your Recovery Process
Having backups isn't enough — you need to verify they work. A backup that can't be restored is just wasted disk space.
# Monthly restore test
pg_restore -d test_db backup.dump
psql -d test_db -c "SELECT COUNT(*) FROM critical_table"
3. WAL Archiving Enables Options
Without WAL archiving, your only recovery option is the last backup. With it, you can restore to any point in time. The overhead is minimal; the insurance is invaluable.
4. External Integrations Can Save You
The API integration that saved this recovery wasn't designed for backup purposes. But because core data was synced externally, we had a recovery path. Consider where else your data lives.
5. Corruption Happens
Disk errors, power failures, container crashes — there are many ways data can become corrupted. Plan for it.
Database Recovery Checklist
How prepared are you?
- Automated backups running daily?
- Backup verification (can you actually restore)?
- WAL archiving enabled?
- Replica or standby database?
- Health checks configured?
- Documented recovery procedure?
- Team knows what to do if database won't start?
If you answered "no" to any of these, consider fixing that before you need to.
This case study is based on a real incident. Specific identifiers have been sanitized to protect client confidentiality.