Zero Downtime Table Migrations using a Double Write Methodology

A database table’s lifecycle is constantly becoming more and more complex. It grows beyond its original territory, taking on new responsibilities and interacting with other parts of the code base in unpredictable ways. In order to keep the code ecosystem flourishing, it’s important to periodically refactor tables with too many responsibilities to make it easier to develop new features and fix bugs.

In the health benefits world at Gusto, we found ourselves facing a giant table that we needed to break up in order to complete product features. In this post, we detail the steps we took to make this a zero downtime migration.

Background

Our initial health benefits system had two tables to store employee benefits data: subscriptions and waivers. The subscriptions table began to outgrow its initial purpose, and it also shared functionality with the waivers table. Other details about these tables are not important to know for the purposes of this post; our focus will be on the data migration to our new table - selections. The new selections table is a combination of a part of the subscriptions table (unprocessed subscriptions) and waivers. Adding the selections table allowed us to remove the waivers table and delete unprocessed subscriptions from the subscriptions table. We’ll go into more detail about the table mappings later.

In order to do the migration to the selections table with zero downtime, we followed a standard procedure that allowed us to confidently migrate the data and start using selections in place of unprocessed subscriptions and waivers. Examples of this procedure can be found in this Stripe blog post or this post from a Paypal data engineer. We’ll discuss how we followed this approach and what we learned throughout the process.

There are three major steps:

  1. The double-write: Write data to both the old and new tables
  2. The single-read: Instead of reading the (identical) data from the old table, read it from the new table.
  3. The single-write: Once we are exclusively reading from the new table, we can stop writing to the old table and delete the old, no longer read data.

The Double Write

The first phase of our project duplicates the data into our new selections table and ensures that the selections table and the unprocessed subscriptions and waivers tables stay in sync. This allows us to confidently use selections in place of unprocessed subscriptions and waivers.

Double writing transforms one set of data to a new set of data using a mapping function. The simplest mapping function, in which one object in the source table corresponds to one object in the target table, is shown below:

Screen-Shot-2018-08-13-at-1.36.41-PM

In our case, the initial objects are unprocessed subscriptions and waivers and the new objects are selections. Our mapping function looks like this:

Screen-Shot-2018-08-01-at-2.51.16-PM

Our goal is to reach and maintain this mapping. For each old object, we must ensure that there is exactly one new object with corresponding attributes. For example, an unprocessed subscription must have exactly one corresponding selection, and attributes on both objects, such as employee ID, benefit type, and policy ID, must match.

The double write process consists of two steps to maintain the integrity of the mapping:

Step 1) Each time we perform a CRUD operation on our old table, we must do the same write to the new table.

Screen-Shot-2018-08-01-at-2.51.31-PM

In the above example, we add the code in green to create the new data - for each unprocessed subscription, there must be a selection.

We wrap all of the double write code in a transaction that ensures atomicity of the writes and rolls back if one write fails.

Step 2) Existing unprocessed subscriptions and waivers also need matching selection objects. We run a migration that creates selections for each of the already existing unprocessed subscriptions and waivers.

After these two steps, we expect our mapping function to return a selection for any unprocessed subscription or waiver.

To have an accurate double write, we need to catch all the places where we write unprocessed subscriptions or waivers. We create a Sidekiq job that runs a “differ” between the objects in our new selections table and all unprocessed subscriptions and waivers. The differ checks that each unprocessed subscription and waiver in the old tables has a matching selection. We report differences to a Slack channel in a readable format, allowing us to track and fix the errors easily.

Screen-Shot-2018-08-01-at-2.51.49-PM

Above is an example of the output of our differ. Here we see that the subscription has attribute policy_id set to 21, while the matching selection has policy_id set to 22. Our job is to investigate where we are updating the subscription’s policy_id but not the selection’s!

We run our differ until we are confident that we are successfully double writing selections from unprocessed subscriptions and waivers. The differ was set to run asynchronously for recent records every hour and all records every day, but it also could have been run in realtime on every object save.

We had many learnings about the double-write part of this project. Our differ only caught errors generated by code paths that were actually run. This is a problem when some code paths are not frequently hit.

If we were to do this project again, we would first move all CRUD operations against the old tables into a service class. Then our old tables could warn if they weren’t being written to through this service class (perhaps by setting an instance variable on the object within the processor). This would allow us to verify the double-writes within live code, and would give us a pinpointed location where a double-write is not happening correctly.

The Single Read

After we implement double writing to the selections table, we can read from either the old tables or the new table. The single read phase ensures that we are no longer reading from the old tables.

We implement single reads by making two types of changes:

  1. Deriving primitive values (i.e. boolean, string, integer, etc) using the selections table instead of the subscriptions or waivers table. Here is an example on one of our core models, Enrollment:
class Enrollment
  def unprocessed?
    subscriptions.unprocessed.any?
  end
end

We are using unprocessed subscriptions to derive a primitive boolean value. Replacing this to read from the selections table is straightforward:

class Enrollment
  def unprocessed?
    selections.unprocessed.any?
  end
end

Now we are single reading from our new selections table!

  1. Modifying places where a class expects an interface of the old model. Now that the data is the same on both the old and new tables, we can pull the data off of the new table. For example, we pass in selections instead of subscriptions to an existing class that previously expected subscriptions.

There is no rock-solid guarantee that we have completely implemented single reads. However, we can override attributes on our subscriptions and waivers tables to catch places where we are still using these old tables. We do this using metaprogramming to override methods and Bugsnag to report the errors, but it can also be done by explicitly defining override methods and using a logging or reporting service that is supported by your infrastructure. Metaprogramming allows us to write our code in a more DRY way, but at the cost of some clarity.

class Subscription < ApplicationRecord

  COLUMN_NAMES = [
    { name: 'id' },
    { name: 'policy_id' },
    { name: 'processed' },
  ].freeze

  COLUMN_NAMES.each do |column|
    column_name = column[:name]
    define_method(column_name) do |*_args|
      unless read_attribute(:processed)
        NotifySelectionsSingleReadError.verify(caller) 
      end
      read_attribute(column_name)
    end
  end

  class NotifySelectionsSingleReadError
    def self.verify(backtrace)
  if !whitelisted(backtrace)
    raise “Should not be using the old tables!” if Rails.env.test?
    Bugsnag.notify(“Should not be using the old tables!”) 
    	end
    end
  end
end

The NotifySelectionsSingleReadError class includes a whitelist for places where we should still be using our old model, such as in our double writes. Raising only in test ensures developers do not write new code that uses the old tables.

With this verification check in place, only three (largely automatic) steps remain to fully implement single reads:

Step 1: We proceed to fix our test suite, which now fails for a non-whitelisted read of our old table.

Step 2: In our local development environments, we modify the verification to fail if reading the wrong model and run through several key scenarios in our application until we can complete each scenario with no errors. For example, we go through an employee enrollment flow, process a subscription, and run our maintenance sidekiq jobs, checking that we never read from our old tables.

Step 3: Over the course of several days, we keep an eye on our bug reporting service (Bugsnag) for incorrect single reads in production and fix them as they come in.

Much like the double-write part of the project, we had several learnings about this phase. Firstly, we could have similarly created a special service class that was responsible for reading subscriptions and selections. This would have allowed us to first focus on refactoring reads into this class, and then having the class determine which object type should be read. This approach, however, would still not have helped if the new table had a different interface. Therefore, an intermediate change that would have taken longer, but would have made the single read step very easy, would have been to refactor our reads to use value objects, so later reading from selections would be as simple as creating the same value object differently in a single place.

There were many parts of our code base that took a subscription and converted it to a value object. This made the single read pretty straightforward, because we just needed to construct the same value object with selections instead.

Another thing we learned when single reading is to think beyond the ORM (in this case, ActiveRecord). Database-level reads, such as foreign-key constraints, need to be removed or changed. For microservices or cross-platform architectures, you’ll have to think about other objects that have foreign keys that point to your old tables. We ran into several issues where users had bookmarked CRUD-like URLs that referenced unprocessed subscriptions, which no longer existed after implementing single reads!

The Single Write

Once single reads are in place, we no longer need our original tables. In fact, we could fill our old tables with junk data or remove the tables entirely and nothing should break. We still want to audit our double write phase, however, to ensure that none of the writes to our new table are dependent on the old tables. For example, we find that we sometimes write to one of our old tables and then do a 1:1 copy to our new table, rather than use application logic to do the same write twice.

After a manual audit to ensure complete independence between writes to the old tables and the new table, we take the following steps:

  1. We stop creating new rows in the old tables by removing any create calls for unprocessed subscriptions or waivers.
  2. We destroy all waivers and unprocessed subscriptions in the old tables after we stop creating new rows.
  3. We start removing remaining CRUD operations (which by now are dead code) against the old tables.

Because of our work during previous steps, removing the CRUD operations can be done rather indiscriminately, by simply removing any code that writes to the old tables.

Here is an example of a CRUD operation where we destroy a subscription along with a selection:

class SelectionsController
  def destroy
    ActiveRecord::Base.transaction do
 selection = employee.selections.find(params[:id])
 selection.destroy!
 subscription.destroy!
    end
  end
end

After steps 1 and 2, we no longer need to destroy subscriptions, and the code in red can be removed!

Lastly, to verify that all steps are complete, we remove the whitelist generated in the single read phase and apply the strategies from the single read phase (i.e. fixing the test suite, monitoring bugsnag, etc.). We also add a regular audit that verifies that our waivers table has 0 rows and our subscriptions table only contains processed subscriptions!

Learnings from this phase are similar to those in the double-write phase. With CRUD service classes, single-writing would be much simpler; we would remove the subscription version of the change in the service class and then rename the service class if we wanted to.

We’ve finished the three phases of our zero downtime migration and ended up with a cleaner, leaner model that has a clearly defined role. The new model is more easily understood, simple to work with, and allows us to write new features. Refactors like this demonstrate the importance of giving your initial data design a lot of thought. Codebases are constantly evolving, however, and refactors are sometimes unavoidable. Having a solid execution plan is important for when it comes time to refactor.

Thanks for reading!

Stephanie Tsai and Alex Evanczuk

Stephanie Tsai has been a benefits engineer at Gusto for almost 2 years. She graduated from Stanford, plays cello, badminton, and board games, and drinks a potentially unhealthy amount of boba.

Alex Evanczuk has been on the benefits team for over a year and a half. He graduated from University of Pennsylvania and has no other hobbies.