Zero-downtime MySQL migration at TB scale

MySQL Kubernetes Migration

In this case study

Table of Contents

The Problem

A fintech SaaS company was running a 2TB MySQL database on aging bare metal servers. The infrastructure was nearing end-of-life. Backups were manual and untested. There was no failover — if the primary went down, the platform went down. Migration had to happen soon, but the cost of downtime was unacceptable.

The Challenge

This wasn’t just moving data. The company processes billions of transactions annually. Every second of downtime costs money. A traditional big-bang migration wasn’t an option.

The Solution

We designed and executed a zero-downtime migration using a staged approach:

Phase 1: Replica Setup

Built a Percona XtraDB Cluster (PXC) on Kubernetes with 3 nodes
Set up MySQL replication from the old bare metal primary to the new PXC cluster
Validated replication lag (consistently < 100ms)
Ran queries against the replica to verify correctness

Phase 2: Connection Routing

Deployed ProxySQL in front of both clusters
Wrote traffic routing rules to send reads to the new cluster
Kept writes going to the old primary while replication caught up
Monitored replication lag continuously

Phase 3: Cutover

Once replication lag was zero for 24 hours, we executed the cutover
Redirected write traffic from old primary to new PXC cluster (5-second switch)
Kept old primary as read replica for immediate rollback
Monitored for 2 hours without issues, then decommissioned old hardware

Phase 4: Backups

Configured automated XtraBackup to S3 (daily + incremental)
Set up backup verification (daily restore tests to isolated instance)
Documented recovery procedures

Technologies Used

MySQL 8.0 (InnoDB)
Percona XtraDB Cluster 8.0
ProxySQL 2.x (connection pooling and routing)
XtraBackup (backup automation)
Amazon S3 (backup storage)
Kubernetes (PXC cluster management)

Results

Zero downtime during migration cutover
2TB database successfully migrated and operational
Automated backups running daily with validation
RTO < 5 minutes for any single-node failure (cluster recovers automatically)
RPO < 1 hour (daily backups with incremental recovery)
Team confidence — the database platform is now understood and operable

The company was able to decommission 20-year-old bare metal hardware and move to a cloud-native database platform — without any production disruption.

Ready to discuss a similar challenge?

Let's talk about your infrastructure goals.

Get in touch