MySQL incremental backups at scale: From 12-hour backups to 30-minute operations
MySQL
Backups
Database Reliability
Cost Optimization
In this case study
Table of Contents
The Problem
A data-heavy startup was running 2TB of MySQL in production. Full backups took 12+ hours and failed regularly. The team had no reliable way to test recovery — full restoration took days. Backup failures went undetected until needed. The cost of managed backup solutions (Kasten, AWS DMS) was $150+/month just for backups.
The Challenge
Scaling backups at TB size is deceptively hard:
- Full backups are expensive — 12GB+ each, one per week creates gaps
- Incremental backups break easily — backup chains get orphaned, chains get deleted while incrementals still depend on them
- Point-in-time recovery is complex — requires replaying a full backup + multiple incrementals in correct order
- Cost adds up — managed backup solutions charge per GB backed up
- Testing recovery is risky — restoring TB of data takes too long, team avoids testing
The Solution
We built a specialized backup system designed specifically for MySQL at scale:
Phase 1: Backup Architecture
- Implemented XtraBackup for MySQL (understands InnoDB page-level changes)
- Full backups weekly + incremental backups every 6 hours
- Local-first storage (fast SSD for rapid recovery)
- S3 sync for compliance and disaster recovery
- Automatic encryption (AES256) and compression (zstd)
Phase 2: Chain-Aware Retention
- Built system to track backup chain dependencies
- Only delete a backup if ALL backups in its chain are older than retention period
- Prevents orphaned incrementals that can’t be restored
- Clear naming convention for chain tracking
- Automatic cleanup of broken chains
Phase 3: Point-in-Time Recovery
- Scripted restore process for any specific timestamp
- Replays incremental backups in order to target time
- Auto-detects encryption and compression
- Dry-run support for safety verification
- Restore testing built into weekly procedures
Phase 4: Operations & Visibility
- Automated weekly analysis of backup chain health
- Storage usage tracking (local vs S3)
- Backup integrity verification
- Simple shell scripts (no operators or controllers needed)
- Clear monitoring and alerting
Results
Reliability:
- ✅ Backup success rate: 99.8% (was 60% with manual process)
- ✅ Recovery tested monthly (previously untested)
- ✅ Point-in-time recovery capability verified
- ✅ No orphaned backup chains in 12+ months
Operational:
- ✅ Backup time: 12 hours → 30 minutes (full) + 5 minutes (incremental)
- ✅ Recovery test cycle: days → hours
- ✅ Backup troubleshooting: 80% less time spent
- ✅ Team confidence in recovery procedures: high
Financial:
- ✅ Cost: $3/month (S3 storage) vs $150+/month for managed solutions
- ✅ Annual savings: ~$1,800
- ✅ ROI from implementation: immediate
Key Takeaways
- Incremental backups save storage at scale — but only if chain management is automated
- Database-native tools are essential — generic backup solutions don’t understand MySQL’s structure
- Point-in-time recovery requires full + incremental chain — but the restore process can be simple
- Recovery testing must be automated — if it’s manual, it won’t happen regularly
- Local-first + cloud sync beats pure cloud backups — faster recovery, lower cost, better compliance