Migrating 1.3TB of Primary Data from Self-Hosted MySQL to GCP CloudSQL

Context.

Flip is a fintech company in Indonesia. Their primary MySQL database was self-managed and self-hosted on a cloud provider. They were looking to move away from that cloud provider to GCP and wanted to migrate the self-hosted MySQL setup to GCP’s CloudSQL. MySQL server received the peak traffic of 10,000 database requests per second, and the total database size was 1.3 Terabytes.

Problem Statement.

Migrate from a self-hosted MySQL cluster containing three nodes in a Primary/Replica setup to a GCP CloudSQL setup. As a fintech company data, the downtime has to be minimal without sacrificing data consistency, integrity and durability, and overall adherence to the organization's SLA.

Outcome/Impact.

Seamlessly migrate all the application traffic with less than 10 minutes of downtime for the database switchover.

Data consistency was intact.

Setup up Backup and Restore processes.

We could handle the peak traffic with less hardware utilization, thereby saving on cloud costs.

Solution.

The below diagram shows the initial setup. It consisted of the applications run as Pods. These Pods point to the ProxySQL High Availability Cluster. This cluster routes all the write queries to the MySQL Primary and Read queries to the MySQL read replicas. The read replicas are in sync using binary logs. The SRE team managed each MySQL server node.

Existing self-hosted MySQL setup

We evaluated different options to migrate data from MySQL to CloudSQL. It includes

Point in-time backup and restore
GCP Data Migration Service (DMS) with support for continuous replication
CloudSQL External Replication

After comparing these options, we found DMS to be easy to set up, uses native binary logs for replication, and comes with monitoring of the data lag and data copied.

We found a few downsides with DMS.

DMS itself provisions the CloudSQL instances. So we could not use Terraform. But we could later import it.
CloudSQL Primary instance provisioned is in replica mode and not HA. It blocks read/writes when promoting the Instance as Primary. We had planned for a small window of downtime that involves promoting.
DMS could not reach the source MySQL primary in another Cloud Provider even though we had set up VPC peering. We had to set up some IPtables NAT rules to work around it.
DMS runs large Select * … statements on the entire table, which causes writes to be blocked, affecting the DB performance for huge tables. So to work around that, we used a read-only replica with binary logs enabled as the source for DMS.

We started the DMS a couple of days before D-Day, and it took around three days to copy 1.3TB data, and the DMS replication lag came towards 0. At this point, the CloudSQL replica was in sync with the primary MySQL.

For the cutover

We had put all customer-facing apps in maintenance mode to indicate an easy-to-understand response to users.
We were already using ProxySQL for our read/write workload separation, so we used it to our advantage to simplify the cutover process. We reduced the write weights to ensure that internal apps are not writing to the primary database. We added a backend host entry for CloudSQL and adjusted the weights during cutover to move the traffic from hosted MySQL to CloudSQL.
Once basic read/write was working and tested with internal apps, we provisioned more CloudSQL replicas to match our original cluster size and updated ProxySQL to add the new read replica entries.

On a high level, the cutover process looked like below

Cutover process

After the production traffic was served successfully, we observed frequent tiny spikes of replication lags. We tuned the Innodb_flush_log_at_trx_commit flag. CloudSQL also supports Parallel replication, but the flag worked great enough for us that we decided to enable parallel replication later in the future.

We also observed that the GCP Cloud Monitoring metrics do not represent the true values of the metrics, such as replication lag, so we ended up setting up a native MySQL agent (Datadog in our case) which would connect to CloudSQL over MySQL connection to get most accurate values which we used to set up alerts.

Once the cutover was complete, the final setup looked as below. The application pods connected to ProxySQL HA setup, routing the Write queries to CloudSQL Primary and Read questions to the CloudSQL read replicas.

New GCP CloudSQL setup

The entire migration process took two months, including planning for MySQL and other stateless workloads for a team of 4 engineers. We migrated the stateless services without any downtime. MySQL was the most challenging part of the overall journey. We achieved the MySQL migration with a mere 10 minutes of downtime.

Tech stack used.

See our other work.

Data migration

Problem

Zero downtime MySQL schema migrations for 400M row table

See our solution