Gusto is a zippier experience in 2023 for everyone.
For businesses with 25 or more employees, pages load 1.4 times faster. After loading gusto.com for the first time, navigations in Gusto are 2 times faster. For smaller businesses, page loads are 1.2 times faster, and all navigations after the first are now 1.8 times faster. Gusto’s internal customer support tools are faster, too: 30% faster than a year ago.
This effort to improve performance was successful because we:
- Captured full-stack metrics using native browser APIs
- Brought in external help, in the form of an experienced web performance expert
- Caused a cultural shift in engineering: “engineers like us do performance work like this”
- Had an ambitious and clear goal driven by a strong business need
Focus on Full Stack Metrics
At many software companies, engineers don’t know that the site is slow for real users in production. In development, on one's local machine, things can often feel quite fast.
If Gusto’s engineers could feel our customer’s pain, they would feel an innate need to tidy up and to make the site faster. We can harness this internal motivation to “tidy up”.
This idea of tidying is something we learned from former Gustie Kent Beck. Engineers have a desire to work on a good, well-crafted application. Show them the problem, and they will fix it.
Automated reminders, monthly packs, simple dashboards
How can we capture a customer's painful, slow experience through metrics?
High-level metrics reported from client browsers formed the basis of our effort. This sounds simple, but it's often not. It's easy to get over-focused on metrics which are simple to measure, but unimportant. Back-end metrics are often quite simple to capture, for example. But, in our metrics, back-end response time often only makes up 20% of the median page-load.
We presented these numbers to engineers in a few ways that were easy to understand, impossible to ignore, and would drive the impulse to tidy up:
- Automated reminders in Slack. Every 2 weeks, each team receives a summary of the performance of pages that they own.
- Monthly metrics packs. Every month, each product group received a pack of metrics on code quality. One of those metrics was page load performance, which we graded on a simple “red/yellow/green” scale.
- Easy to understand dashboards. These provided the context and “next action steps” for engineers to fix “red” pages.
Tracking page loads in Datadog Logs using native browser APIs
Our two critical metrics were the initial load time (a cold boot of the app) and route change time (URL change). These events are application-specific, and we needed custom code to instrument them.
Our frontend team created a framework for our React application to capture these metrics. It uses the User Timing API, and reports metrics back to our Rails app, where they turn into Datadog Logs.
We also used the Resource Timing API. This records the requests each React route change made to our backend. In our app, that's usually lots of GraphQL and JSON requests.
Bring in Some Help
At the start of this effort, Gusto contracted with The Speedshop. The Speedshop is a Ruby on Rails performance consultancy run by me, Nate Berkopec.
I held ultimate responsibility for the success or failure of this project. Yet, you can't change an application (or organization) of Gusto's size alone. I needed to create leverage. To do that, I made systems and processes.
Create a predictable, repeatable process
When I arrived at Gusto, no dedicated, company-wide performance effort existed. Any work that was being done was the result of a few engineers in their spare time.
Performance work was ad-hoc and perceived as risky. Managers couldn’t be sure that performance efforts had any effect or if they would be successful. When doing perf work, engineers followed red herrings and disappeared down rabbit holes.
As a result, performance work often languished in the backlog. Performance work wasn't prioritized, because it was ineffective. Slow page loads metastasized.
As a consultant, I've seen it all, so I have an "off-the-shelf" toolkit for fixing perf. Specialized to Ruby on Rails, it's a repeatable and predictable process. My job was to deploy this process and train team members on how that process works.
This helps make managers comfortable with prioritizing this work. "If you follow these steps, I can promise X, Y and Z."
We needed to provide a process that was:
- Measurable. Our process had to show real improvements, so managers could justify doing the work.
- Teachable. This can't be a "staff engineer problem". There aren’t enough of them. Anyone can write a performance bug, so everyone needs to be able to fix them.
- Repeatable. Managers needed to trust teams to work on performance. The process had to be predictable and deliver improvement each time.
This was that process for performance issues at Gusto:
- Review automated alerts and perform monthly performance reviews. We created alerts, monitors, and dashboards for monthly review on a per-team basis.
- Replicate locally. Replicating issues in development allows us to prototype and iterate on a solution. We need the shortest possible feedback cycle to answer "is this faster?".
- Fix and benchmark. Since we’ve replicated the issue, we can also show that we’ve fixed it by fixing the replication. We can be positive that there will be a strong impact in production. Often, we can also write a benchmark to show that we’ve solved the problem, comparing before and after.
- Deploy and verify impact. After deploying the changes, we verify that our fix had the impact we expected. We use data, such as Datadog Notebooks, to prove that we made things better. No guessing or hand-waving allowed.
It's the scientific method: hypothesis, experiment, analyze and repeat. This empirical process formed the basis of all of our performance work.
Make the mistake impossible
I’ve worked with Gusto for almost 2 years now. They’ve been a fantastic and interesting client. It’s one of the largest Rails codebases in the world, with 5.8 million lines of code. Many of Gusto’s unique challenges come from this scale.
Gusto’s scale also means that performance fixes cannot come only from “upskilling”. Not everyone will be 100% performance-minded all the time.
At Gusto, I learned that systems are power. We must create or change the system to ensure lasting, long-term technical change. Don't train engineers to avoid a certain mistake; make that mistake impossible to make. Don’t count on engineers to always check for a particular problem; have CI or an automated process do it for them.
Here’s an example. On many pages, our metrics showed that we would make one request after another to our backend. This serial request chain was often 4 or 5 requests long! The 2nd request didn’t start until the 1st one finished, and so on and so forth. That’s slow! Instead, we should do this in parallel.
For developers, we deployed a simple indicator to make this mistake obvious. It shows devs how many times in a row the page has made a request to the backend. The indicator turns red if it happens 2 or more times. This mistake was no longer possible to ignore.
Change the Culture: "People Like Us Do Things Like This"
Seth Godin, the business writer summarizes culture as: “people like us do things like this.” As Seth says:
“For most of us, from the first day we are able to remember until the last day we breathe, our actions are primarily driven by one question, “Do people like me do things like this?””
Gusto needed a cultural shift: performance work was not something “people like us” did at Gusto.
Guilds, thanks, and underground hackathons
We wanted to make performance work visible and culturally rewarded at Gusto. Here are some ways we did that:
- A performance Slack channel and a performance guild. At Gusto, there are several “guilds”. Groups of engineers interested in a particular topic meet once every month. While there are many “guilds”, the Performance guild has been one of the most active. The Slack channel for this guild has been a constant stream of celebration, kudos and interesting questions.
- “Kudos” before every engineering all-hands meeting. At each all-hands, a 3-minute prelude celebrated major performance “wins”. Want to get recognized in front of the entire company? Ship a performance win.
- Hackathons. Informally, we scheduled a few performance “hackathons”. Engineers could get together and work on perf for an entire Friday. Gusto has a strong culture of self-direction for engineers, so sneaking in one day every few months wasn't difficult.
Now we've got a new culture for perf work: “Programmers like me do work like this.”
Set An Ambitious Goal, from a Strong Business Need
All technical projects have an easy start. But eventually, the work gets hard and resistance grows. You've converted 80% of code to the new system, and the hardest 20% remains. The low-hanging fruit are all picked. Teams complain about workload and new priorities.
It’s at this point that many projects disintegrate. Teams move on to new projects, or impatient managers make everyone get back to shipping “real features”.
Instead, a clear business case of “if we do Y, we will increase revenue by $Z” reminds everyone that the work is important.
Gusto CTO Edward Kim realized that faster performance was a strategic opportunity. Businesses with a larger number of employees were experiencing quite intolerable load times. If you have more employees, we have more taxes to calculate, withholdings to create, and much more. It's all in proportion to how many people are on payroll.
But, bigger customers are more valuable: they pay more. Bigger customers on Gusto should have better experiences, not worse! If O(n) could become O(log(n)), we would improve acquisition and retention.
This clear and present need allowed us to set goals, budgets and workstreams on long-term horizons. Performance work was therefore protected from the weekly whims and turbulent winds of project management.
Setting a SMART Goal
The project’s timeline was set for one year. But what would the final goal be?
We knew that small businesses loved Gusto. We weren’t getting any complaints about how fast Gusto was from small businesses. The experience was good for this group.
If we could replicate the small business experience for customers of any size, we would be set. As a side effect, speeding things up for larger companies would make things even faster for small businesses. Two birds, one stone.
Here’s the goal we came up with: "Reduce load times for all companies with less than 100 employees to have the same load time as companies with less than 25 employees."
This goal was SMART: specific, measurable, attainable, relevant, and time-bound:
- Specific. We set the goal to reduce load times to 1.8 seconds for a particular cohort of Gusto customers.
- Measurable. We built and verified the instrumentation to measure this metric and our progress.
- Attainable. It was already achieved for a different cohort of customers.
- Relevant. Making things faster for bigger customers meant more and bigger customers on the Gusto people platform.
- Time-bound. We picked the arbitrary timeline of 1 year.
We Hope You Enjoy a Faster Gusto
This project was not only the work of The Speedshop, but depended greatly upon the excellent work and initiative of many of the great people at Gusto, including but not limited to: Toni Rib, Mangesh Tamhankar, Ngan Pham, Stephan Hagemann, Catherine Zarra, and Kelly Sutton.