Education TechnologyJanuary 27, 20268 min read

Recovering a Database Everyone Thought Was Lost

A production database crashed hard. No backups. Corrupted files. Thousands of users locked out. Here's how we brought it back with zero data loss.

Database RecoveryIncident ResponseInfrastructure Consulting

The Worst Morning Call

"The entire system is down. Nothing works."

That's never the message you want to receive. A SaaS platform serving thousands of daily users was completely offline. The database wouldn't start — it would try, crash, try again, crash again. An endless loop of failure.

This is the story of how we recovered a database that seemed unrecoverable, and the hard lessons that came with it.

Data center infrastructure

Understanding the Damage

The database logs revealed the problem immediately:

PANIC: invalid lp
LOG: startup process (PID xxx) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure

This is PostgreSQL telling you the Write-Ahead Log (WAL) is corrupted. The WAL is how PostgreSQL ensures data consistency — every change is logged before it's applied. When WAL is corrupted, the database can't determine what state it should be in, so it refuses to start.

The situation:

Database in continuous crash loop
All APIs returning 500 errors
Thousands of users affected
Mobile apps completely non-functional

The infrastructure gap:

No automated backup system
No streaming replica
WAL archiving not enabled
Last manual backup was weeks old

First Rule: Don't Make It Worse

Before touching the production server, we created a complete backup of the corrupted volume:

# Create a backup of the Docker volume
docker run --rm \
  -v db-data:/data \
  -v /backup:/backup \
  alpine tar czf /backup/db-backup-$(date +%Y%m%d).tar.gz -C /data .

# Create a second safety copy
cp /backup/db-backup-*.tar.gz /backup/db-backup-ORIGINAL.tar.gz

We then set up an isolated testing environment:

# Create test container with backup data
docker run -d \
  --name db-recovery-test \
  -v /backup/test-data:/var/lib/postgresql/data \
  postgres:15

Every fix we tried, we tried in the test environment first.

Server analysis

The Recovery Process

Step 1: Assess the Damage

First, we needed to understand what was actually corrupted:

# Check PostgreSQL data directory integrity
pg_controldata /var/lib/postgresql/data

# Look for corrupted multixact files
ls -la /var/lib/postgresql/data/pg_multixact/members/
ls -la /var/lib/postgresql/data/pg_multixact/offsets/

The pg_controldata output showed mismatched transaction IDs and corrupted multixact entries — files that track row-level locking across multiple transactions.

Step 2: Reset the Write-Ahead Log

PostgreSQL provides pg_resetwal as a last-resort recovery tool. It essentially tells the database "forget the transaction log and start fresh."

# Stop any running instances
docker service scale db-service=0

# Reset WAL with corrected parameters
# These values must be calculated from pg_controldata output
pg_resetwal -f \
  -m [next-multixact-id],[oldest-multixact-id] \
  -O [next-oid] \
  /var/lib/postgresql/data

Warning: Getting these parameters wrong can cause additional corruption. We calculated them from:

The pg_controldata output
Analysis of existing multixact files
Comparison with a known-good staging database

For the corrupted multixact files, we had to extend them:

# Extend corrupted multixact member file
dd if=/dev/zero bs=4096 count=10 >> pg_multixact/members/0000

# Extend corrupted multixact offset file
dd if=/dev/zero bs=4096 count=5 >> pg_multixact/offsets/0000

After these fixes, the database started successfully.

Step 3: Identify Corrupted Tables

The database was running, but we needed to check data integrity:

-- List all tables and try to count rows
SELECT schemaname, tablename
FROM pg_tables
WHERE schemaname = 'public';

-- Test each table
SELECT COUNT(*) FROM table_name;

Most tables returned results instantly. But one critical table — containing core user records — would hang indefinitely. The data pages themselves were corrupted.

Step 4: Rebuild the Corrupted Table

We needed the data from that table, but couldn't read it directly. Fortunately, the application synced with an external API that contained the same records.

# Fetch records from external API
curl -X POST "https://external-api/endpoint" \
  -H "Authorization: Basic [token]" \
  -d '{"query": "all_records"}' \
  > recovered_data.json

# Convert to SQL insert format
jq -r '.records[] | @csv' recovered_data.json > recovered_data.csv

Then we rebuilt the table:

-- Create new table with correct schema
CREATE TABLE new_records (LIKE old_records INCLUDING ALL);

-- Import recovered data
COPY new_records FROM '/tmp/recovered_data.csv' WITH CSV;

-- Create placeholder records for historical entries
-- (to maintain foreign key integrity)
INSERT INTO new_records (id, name, status, ...)
SELECT DISTINCT referenced_id, 'Archived Record', 'archived', ...
FROM other_table
WHERE referenced_id NOT IN (SELECT id FROM new_records);

-- Swap tables
ALTER TABLE old_records RENAME TO old_records_corrupted;
ALTER TABLE new_records RENAME TO records;

-- Restore foreign key constraints
ALTER TABLE enrollments
  ADD CONSTRAINT fk_record
  FOREIGN KEY (record_id) REFERENCES records(id);

-- Rebuild indexes
CREATE INDEX idx_records_status ON records(status);
CREATE INDEX idx_records_created ON records(created_at);

Database recovery

Step 5: Verify Integrity

We compared the recovered database against our staging environment:

-- Compare table counts
SELECT
  (SELECT COUNT(*) FROM information_schema.tables
   WHERE table_schema = 'public') as prod_tables,
  -- Compare with staging...

-- Check for orphaned foreign keys
SELECT COUNT(*)
FROM child_table c
LEFT JOIN parent_table p ON c.parent_id = p.id
WHERE p.id IS NULL;

-- Verify index count matches
SELECT COUNT(*) FROM pg_indexes WHERE schemaname = 'public';

Results:

45 tables recovered (44 from backup, 1 rebuilt)
All indexes restored
All foreign key constraints valid
Zero orphaned records
Zero data loss

Prevention: Making This Not Happen Again

Automated Daily Backups

#!/bin/bash
# /etc/cron.daily/pg-backup

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=/backups/postgresql
CONTAINER=$(docker ps -q -f name=db-service)

# Create compressed backup
docker exec -u postgres $CONTAINER \
  pg_dump -Fc database_name > $BACKUP_DIR/backup-$DATE.dump

# Verify backup integrity
pg_restore --list $BACKUP_DIR/backup-$DATE.dump > /dev/null 2>&1
if [ $? -ne 0 ]; then
  echo "Backup verification failed!" | mail -s "BACKUP ALERT" admin@company.com
fi

# Retain 30 days
find $BACKUP_DIR -name "*.dump" -mtime +30 -delete

Enable WAL Archiving

Added to postgresql.conf:

# Enable continuous archiving
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'
wal_level = replica

# This enables Point-in-Time Recovery (PITR)
# You can restore to any moment, not just the last backup

Docker Health Checks

# docker-compose.yml or swarm service definition
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U dbuser -d dbname"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 60s

Monitoring Alerts

Configured alerts for:

Connection failures
Disk space below 20%
Long-running queries (> 60 seconds)
Lock wait times (> 30 seconds)
Replication lag (when replica is added)

Cloud infrastructure

Timeline Summary

Time	Event
09:00	Database enters crash loop
09:15	Investigation begins, PANIC errors identified
10:00	Decision to backup before repair
10:30	Backup complete (371MB)
11:00	Local test environment created
12:00	pg_resetwal fix verified locally
13:00	Corrupted table identified
14:00	External API data retrieved
14:45	Production fix applied
15:00	Service fully restored

Total downtime: ~4 hours
Data loss: Zero

Lessons Learned

1. Backups Are Not Optional

"We'll set up backups soon" is how you end up with no backups when you need them. The cost of daily automated backups is trivial compared to the cost of an outage.

2. Test Your Recovery Process

Having backups isn't enough — you need to verify they work. A backup that can't be restored is just wasted disk space.

# Monthly restore test
pg_restore -d test_db backup.dump
psql -d test_db -c "SELECT COUNT(*) FROM critical_table"

3. WAL Archiving Enables Options

Without WAL archiving, your only recovery option is the last backup. With it, you can restore to any point in time. The overhead is minimal; the insurance is invaluable.

4. External Integrations Can Save You

The API integration that saved this recovery wasn't designed for backup purposes. But because core data was synced externally, we had a recovery path. Consider where else your data lives.

5. Corruption Happens

Disk errors, power failures, container crashes — there are many ways data can become corrupted. Plan for it.

Database Recovery Checklist

How prepared are you?

Automated backups running daily?
Backup verification (can you actually restore)?
WAL archiving enabled?
Replica or standby database?
Health checks configured?
Documented recovery procedure?
Team knows what to do if database won't start?

If you answered "no" to any of these, consider fixing that before you need to.

This case study is based on a real incident. Specific identifiers have been sanitized to protect client confidentiality.