Zero-downtime MySQL migration at TB scale

MySQL Kubernetes Migration

The Problem

A fintech SaaS company was running a 2TB MySQL database on aging bare metal servers. The infrastructure was nearing end-of-life. Backups were manual and untested. There was no failover — if the primary went down, the platform went down. Migration had to happen soon, but the cost of downtime was unacceptable.

The Challenge

This wasn’t just moving data. The company processes billions of transactions annually. Every second of downtime costs money. A traditional big-bang migration wasn’t an option.

The Solution

We designed and executed a zero-downtime migration using a staged approach:

Phase 1: Replica Setup

  • Built a Percona XtraDB Cluster (PXC) on Kubernetes with 3 nodes
  • Set up MySQL replication from the old bare metal primary to the new PXC cluster
  • Validated replication lag (consistently < 100ms)
  • Ran queries against the replica to verify correctness

Phase 2: Connection Routing

  • Deployed ProxySQL in front of both clusters
  • Wrote traffic routing rules to send reads to the new cluster
  • Kept writes going to the old primary while replication caught up
  • Monitored replication lag continuously

Phase 3: Cutover

  • Once replication lag was zero for 24 hours, we executed the cutover
  • Redirected write traffic from old primary to new PXC cluster (5-second switch)
  • Kept old primary as read replica for immediate rollback
  • Monitored for 2 hours without issues, then decommissioned old hardware

Phase 4: Backups

  • Configured automated XtraBackup to S3 (daily + incremental)
  • Set up backup verification (daily restore tests to isolated instance)
  • Documented recovery procedures

Technologies Used

  • MySQL 8.0 (InnoDB)
  • Percona XtraDB Cluster 8.0
  • ProxySQL 2.x (connection pooling and routing)
  • XtraBackup (backup automation)
  • Amazon S3 (backup storage)
  • Kubernetes (PXC cluster management)

Results

  • Zero downtime during migration cutover
  • 2TB database successfully migrated and operational
  • Automated backups running daily with validation
  • RTO < 5 minutes for any single-node failure (cluster recovers automatically)
  • RPO < 1 hour (daily backups with incremental recovery)
  • Team confidence — the database platform is now understood and operable

The company was able to decommission 20-year-old bare metal hardware and move to a cloud-native database platform — without any production disruption.

Ready to discuss a similar challenge?

Let's talk about your infrastructure goals.

Get in touch