Eliminating Flaky Ruby Tests

Moving Fast and Not Breaking All The Things

Every day, Gusto engineers work on systems that move billions of dollars, secure sensitive customer information, and pay hundreds of thousands of employees. Our customers' livelihoods depend on our product, and we need to be able to respond to issues quickly, maintain fast iteration velocity, and ensure the stability of our applications in order to support them. None of this would be possible without a rock-solid continuous integration (CI) system that gives us confidence that our changes are safe to rollout.

However, even after investing heavily in building a scalable CI system to support our infrastructure, we noticed flaky tests cropping up in our builds more and more often as our application grew. This had destructive effects on developer productivity: builds were needlessly retried, trust in our test suite was eroded, and frustration with our CI system grew among the team. Non-deterministic tests led to slower iteration velocity because we couldn’t rely on our build results, and fixes were prevented from being shipped in a timely manner. For a high-growth tech startup like Gusto, this could have proven financially disastrous.

In response, we set out on a team-wide effort to debug and stabilize our test suite. Root causes for flaky tests proved difficult to track down, but we found they were often traceable to a handful of common pitfalls. We’ve assembled a list of the most common failure modes below. This best-practices checklist is used by Gusto developers to maintain a clean, deterministic RSpec test suite. Our test suite, using these best-practices, currently validates 300,000 lines of application code by running 40,000 RSpec examples over 250 times a day, with minimal levels of flakiness.

Dangerous Capybara Wait Matchers

For many Rails apps, flakiness presents itself in Capybara request specs. There’s a lot of moving pieces to these tests — a browser is running your application while a computer is triggering UI events faster than any ordinary human could. It can be difficult to ensure assertions are timed correctly when lots of client-side interactions and AJAX requests are occurring.

When you use a Capybara method that doesn’t wait for an element to appear, you may inadvertently encounter a race condition. For example, if you expect to:

In your client, receive a 422 response from the server
Render an error message based on that response

You will want to ensure your test waits for the error message to be rendered.

Unfortunately, Capybara has a variety of methods that do not wait for an element to show on the page, and the docs aren’t as explicit as we would like. Here is a list of methods we avoid when writing Capybara request specs:

Avoid dangerous methods that do not wait

visit(path) / current_path
all(selector) / first(selector)
Accessors: text, value, title, etc

Use safe methods that do wait

find(selector), find_field, find_link
has_selector?, has_no_selector?
Actions: click_link, click_button, fill_in

If you use safe methods, Capybara will wait until the configured timeout (Capybara.default_max_wait_time) for the element to appear. Using these safe methods that do wait when asserting against the DOM will ensure tests are resilient to asynchronous actions in JavaScript code.

Database Pollution

Database pollution occurs when we do not have an self-contained database to work off of when running a new test example. Common failure modes include unexpected numbers in COUNT queries and failed uniqueness constraints while trying to create new records. While the failures seems cryptic, we can apply a simple solution for the entire suite.

database_cleaner — If a test creates records or modifies the database, it’s important the changes are removed before future test runs. We can hook into the RSpec lifecycle to clean the database between each test example. The database_cleaner gem with some simple RSpec configuration options will do this for us:

require ‘database_cleaner’
RSpec.configure do |config|
  config.before(:suite) { DatabaseCleaner.clean_with(:truncation) }
  config.before(:each) do |example|
    if example.metadata[:use_truncation]
      DatabaseCleaner.strategy = :truncation
    else
      DatabaseCleaner.strategy = :transaction
    end
    
    DatabaseCleaner.start
  end
  
  config.append_after(:each) { DatabaseCleaner.clean }
end

By default, we opt for using the transaction strategy because it is faster; a transaction is opened at the beginning of the example, and simply rolled back when it completes. If we use truncation, we need to issue an additional set of queries to truncate the tables. We highly recommend using the transaction strategy wherever possible as the DB time saved really adds up.

Occasionally, truncation is required, most notably when running request specs with Capybara. The thread running your Rails server uses a separate database connection from the one running your test, so neither thread can see database records from the other before they are committed. You’ll notice the usage of append_after(:each) for invoking the DatabaseCleaner.clean method, ensuring that the DatabaseCleaner runs only after all the shutdown steps for Capybara have been run. This prevents any race conditions between the app thread and the server thread from leaving orphaned records in the database. In Rails 5.1+ apps, the database connection is shared across threads and the transaction strategy can be used to avoid this altogether.

TEST_ENV_NUMBER — Giving each test process a unique database will make the test suite more deterministic. Rails will create a single dedicated test database, but when running tests in parallel (for example, using parallel_tests), every process can use the TEST_ENV_NUMBER environment variable to connect to a unique database.

Order Dependent Tests

Sometimes, although a test file may succeed as a whole, specific examples may fail when individually run. This means that each test is not running in an isolated environment, leading to cross-test pollution, and flaky failures. RSpec tries to combat this by providing randomized test ordering, allowing you to specify a reproducible, random run order via the --seed flag. Although by default RSpec runs tests in their defined order, setting the --seed flag helps root out these failures early in the development process.

For example:

describe ‘order dependent specs’ do
  let(:response) { $redis.get(‘my_third_party_response’) }
  let(:parsed_response) { JSON.parse(response) }
  let(:client) { Client.new }
  
  context ‘with params’ do
    let(:params) { { limit: 10 } }
    before do
      $redis.set(
        ‘my_third_party_response’,
        client.get_response(params),
      )
    end
    
    it ‘returns the params’ do
      expect(parsed_response).to have_length(10)
    end
  end
  
  it ‘has the correct IDs’ do
    ids = parsed_response.map { |j| j[‘id’] }
    expect(ids).to match_array [0..9]
  end
end

The second example in this file only passes because the first example populated a shared resource. If the second example were to be run on its own, it would fail since there’d be no value populated in Redis. It’s an easy mistake to forget to hoist the before block up to apply the test setup applies to both examples. Do yourself a favor and let RSpec help you by randomizing your tests.

Non-Deterministic Attributes

When testing how a particular attribute behaves, always use deterministic attribute values. Testing inclusion validations is one common example of this. In general, avoid using variables that change from run to run like:

context ‘with a valid attribute’ do
  let(:model) { build(:animal, type: type)
  let(:type) { [ ‘cat’, ‘dog’, ‘dragon’ ].sample }
  
  subject { model.valid? }
  it { is_expected.to be_valid }
end

This code expresses that cat, dog, and dragon are all legitimate values for the animal type. However, a different value will be used on each test run, meaning this test is not resilient to changes in the system. If dragon were removed as an allowed Animal#type, this test likely would not fail in a pull request for that change. Opt for explicitness in these cases and write a separate test case for each type.

Faker is a commonly used library for creating realistic test values. Its random generators create helpful, and entertaining, data, but they occasionally introduce unwanted randomness.

context ‘with a changed email’ do
  let(:model) { build(:my_model, email: Faker::Internet.email) }
  let(:new_email) { Faker::Internet.email }
  subject { model.save! }
  it ‘sends an email’ do
    model.email = new_email
    expect { subject }.to change(ActionMailer::Base.deliveries, :count).by(1)
  end
end

This will work most of the time, but inevitably a test run will cause Faker to return the same email, making the test fail. In examples like this where we need explicitly different values, leave Faker at home and hardcode them.

This also applies to using factories with factory_bot where values need to be unique. When populating columns enforced by a unique index, be sure to take advantage of sequences to guarantee there won’t be unintended uniqueness constraint violations.

FactoryBot.define do
  # Bad
  factory(:user) do
    email { Faker::Internet.email }
  end
  # Good
  factory(:user) do
    sequence(:email) { |n| “#{n}_#{Faker::Internet.email}” }
  end
end

ID Dependence

Many applications use auto-incrementing IDs in the database, but you can’t always guarantee that the record you create will have the same ID in each test run. When tests run in a transaction, every record created will bump the auto-increment value on your table. Because you cannot guarantee how many records are created before the test, there is no reliable, static value for new records’ ID. Additionally, when tests run with truncation, the auto-increment will be reset for that table.

Instead, always use the model.id accessor when writing assertions.

context ‘when saving a file with an artifact’ do
  let(:document) { create(:document, artifact: ‘test’) }
  subject { document.save_file! }
  # Bad
  it ‘writes to the right path’ do
    expect(File).to receive(:open).with(‘/tmp/documents/1/artifact’)
  end
  # Good
  it ‘writes to the right path’ do
    expect(File).to receive(:open).with(“/tmp/documents/#{document.id}/artifact”)
  end
end

Globally Cached Values

Memoizing values with instance variables is a very common pattern in Ruby. However, when this is done at the class level, tests can pollute one another by persisting stubbed values across examples.

# state_checker.rb
class StateChecker
  UNSUPPORTED_STATES = [‘CA’, TX’]
  
  def self.supported_states
    @supported_states ||= STATES — UNSUPPORTED_STATES
  end
end

# state_checker_spec.rb
describe StateChecker
  describe ‘.supported_states’ do
    before do
      stub_const(‘STATES’, [‘AA’, ‘BB’, ‘CC’, ‘DD’])
      stub_const(‘StateChecker::UNSUPPORTED_STATES’, [‘AA’, ‘BB’])
    end
    it ‘removes the unsupported states’ do
      expect(StateChecker.supported_states).to match_array %w(CC DD)
    end
  end
end

At a glance, this feels like a reasonable way to test the logic of the MyService.supported_states method. However, any subsequent tests that use this method will read the stubbed value instead of the actual supported states value, because the class instance variables are not cleared. We should therefore avoid stubbing when testing memoized methods. However, if it must be done, the memoized method can be reset with MyService.instance_variable_set(:@supported_states, nil).

Time-Based Failures

As a payroll company, we deal often with code that handles business days, quarter-end filings, and other time-based computations. For example, when running a payroll, we compute the day you should receive your paycheck (known as the “check date”). This value is computed based on business days, and can be different based on which day of the week payroll is run. We’ve noticed that sometimes these tests fail because they rely on an implicit assumption that they will always be run on Mondays. In order to combat this, we make sure to use Timecop, stressing each particular date range explicitly.

describe Payroll do
  describe ‘#check_date’ do
    before { Timecop.travel(today) }
    after { Timecop.return }
    context ‘on a Friday’ do
      let(:today) { Date.new(2018, 4, 13) }
      it { is_expected.to be_monday }
    end
    context ‘on a Monday’ do
      let(:today) { Date.new(2018, 4, 16) }
      it { is_expected.to be_wednesday }
    end
  end
end

As best as possible, we try to Timecop all examples that return different results depending on what day they are run. We’ve even wrapped the before and after blocks you see in the example above into a method called time_travel_to that will handle Timecop cleanup after each example.

Reproduction

The next time you come across a flaky test in your test suite, be mindful of the above list, and see if you can reproduce it locally with the following steps:

Run RSpec using the same seed that was used in your CI run
Run the entire list of files that were run before your failing test
Use the --bisect flag to let RSpec do the work of finding the polluting example
Reference this list of common polluters to see if any fit the bill

Conclusion

As Gusto’s business and codebase grew, we noticed flaky test suites slowing down iteration cycles and compromising developer productivity. By investing in a stable CI system and flaky test elimination, we’ve been able to iterate more quickly on changes, ship features to our customers more rapidly, and improve developer happiness. While it requires diligence to stay in front of these and other issues, the above guidelines have helped us root out non-deterministic RSpec tests in our builds and keep our test suite running smoothly.

Engineering ( 50 )

Ruby On Rails ( 16 )

Advice ( 14 )

Career growth ( 14 )

Collaboration ( 13 )

Software Development ( 12 )

Diversity Inclusion ( 11 )

Programming ( 10 )

Modular Monolith ( 9 )

Security ( 9 )

Best Practice ( 9 )

Modularization ( 9 )

Gusto ( 8 )

Spaghetti Code ( 7 )

Ruby ( 6 )

Interviews ( 5 )

Sidekiq ( 5 )

Gradual Modularization ( 5 )

Teamwork ( 4 )

Product Management ( 4 )

Engineering Management ( 4 )

Tech Lead ( 4 )

Refactoring ( 4 )

Monolith ( 4 )

Coding ( 4 )

growth ( 4 )

Technical Strategy ( 4 )

Spaghetti Model ( 4 )

Teams ( 3 )

Startups ( 3 )

Database ( 3 )

Authorization ( 3 )

Guide ( 3 )

Data Engineering ( 3 )

Tidying ( 3 )

Api ( 3 )

Experiment ( 3 )

Decision ( 3 )

Gusto Values ( 3 )

Productivity ( 3 )

Performance ( 3 )

Payroll ( 3 )

Architecture ( 3 )

React ( 2 )

Javascript ( 2 )

PM ( 2 )

Job Hunt ( 2 )

Migration ( 2 )

Redis ( 2 )

Incident Response ( 2 )

Women ( 2 )

Rails Engines ( 2 )

Startup Lessons ( 2 )

Big Data ( 2 )

Getting Started ( 2 )

Data Lake ( 2 )

Pull Request ( 2 )

Code Review ( 2 )

Integrations ( 2 )

Outcomes ( 2 )

MVP ( 2 )

Client Platform Engineering ( 2 )

CI/CD ( 2 )

PII ( 2 )

Testing ( 2 )

Packwerk ( 2 )

Gem ( 2 )

Platform ( 2 )

Mentorship ( 2 )

Internship ( 2 )

Backbone ( 1 )

Webpack ( 1 )

Immutable ( 1 )

Cto ( 1 )

Double Write ( 1 )

Single Read ( 1 )

Models ( 1 )

Double Read ( 1 )

Single Write ( 1 )

Zero Downtime ( 1 )