MySQL incremental backups at scale: From 12-hour backups to 30-minute operations

MySQL Backups Database Reliability Cost Optimization

The Problem

A data-heavy startup was running 2TB of MySQL in production. Full backups took 12+ hours and failed regularly. The team had no reliable way to test recovery — full restoration took days. Backup failures went undetected until needed. The cost of managed backup solutions (Kasten, AWS DMS) was $150+/month just for backups.

The Challenge

Scaling backups at TB size is deceptively hard:

  • Full backups are expensive — 12GB+ each, one per week creates gaps
  • Incremental backups break easily — backup chains get orphaned, chains get deleted while incrementals still depend on them
  • Point-in-time recovery is complex — requires replaying a full backup + multiple incrementals in correct order
  • Cost adds up — managed backup solutions charge per GB backed up
  • Testing recovery is risky — restoring TB of data takes too long, team avoids testing

The Solution

We built a specialized backup system designed specifically for MySQL at scale:

Phase 1: Backup Architecture

  • Implemented XtraBackup for MySQL (understands InnoDB page-level changes)
  • Full backups weekly + incremental backups every 6 hours
  • Local-first storage (fast SSD for rapid recovery)
  • S3 sync for compliance and disaster recovery
  • Automatic encryption (AES256) and compression (zstd)

Phase 2: Chain-Aware Retention

  • Built system to track backup chain dependencies
  • Only delete a backup if ALL backups in its chain are older than retention period
  • Prevents orphaned incrementals that can’t be restored
  • Clear naming convention for chain tracking
  • Automatic cleanup of broken chains

Phase 3: Point-in-Time Recovery

  • Scripted restore process for any specific timestamp
  • Replays incremental backups in order to target time
  • Auto-detects encryption and compression
  • Dry-run support for safety verification
  • Restore testing built into weekly procedures

Phase 4: Operations & Visibility

  • Automated weekly analysis of backup chain health
  • Storage usage tracking (local vs S3)
  • Backup integrity verification
  • Simple shell scripts (no operators or controllers needed)
  • Clear monitoring and alerting

Results

Reliability:

  • ✅ Backup success rate: 99.8% (was 60% with manual process)
  • ✅ Recovery tested monthly (previously untested)
  • ✅ Point-in-time recovery capability verified
  • ✅ No orphaned backup chains in 12+ months

Operational:

  • ✅ Backup time: 12 hours → 30 minutes (full) + 5 minutes (incremental)
  • ✅ Recovery test cycle: days → hours
  • ✅ Backup troubleshooting: 80% less time spent
  • ✅ Team confidence in recovery procedures: high

Financial:

  • ✅ Cost: $3/month (S3 storage) vs $150+/month for managed solutions
  • ✅ Annual savings: ~$1,800
  • ✅ ROI from implementation: immediate

Key Takeaways

  1. Incremental backups save storage at scale — but only if chain management is automated
  2. Database-native tools are essential — generic backup solutions don’t understand MySQL’s structure
  3. Point-in-time recovery requires full + incremental chain — but the restore process can be simple
  4. Recovery testing must be automated — if it’s manual, it won’t happen regularly
  5. Local-first + cloud sync beats pure cloud backups — faster recovery, lower cost, better compliance

Ready to discuss a similar challenge?

Let's talk about your infrastructure goals.

Get in touch