How we introduced granular authorization into our application and API.

authorization_header
Illustration by Camellia Neri

Last year, my team extended Gusto’s authorization system to give admins granular access to their companies’ accounts. In software security terms, authorization is the concept of what a user can do in a system, while authentication refers to who a user is.

For context, Gusto is an application that serves small and medium businesses to meet their HR, payroll and benefits needs. An admin is a typical user in our system, who performs a lot of actions in their account, such as running payroll for employees.

As Gusto has scaled as a company, so have our customers, and our one-size-fits-all approach to authorization no longer supported our customers’ needs. This is a journey that I expect many software engineering teams embark on. My team worked on this project for about ten months, and emerged with ideas and learnings that we thought were worth sharing.

The learnings I’ll share in this post are: take time to clean up the codebase, design for flexibility, choose the right rollout plan, make it harder to introduce regressions.

Take time to clean up

The project spanned several quarters because we were tasked with layering in granular authorization to every affected part of our existing codebase. Sometimes this was straightforward, and several endpoints could be granted as all-or-nothing. However, as with many legacy monoliths, over time we accumulated some “god” models - models that know too much or do too much. This often required picking apart each field in our API response, and determining whether or not we should expose it for a given permission. A lot of our work was actually auditing how these fields were used.

Screen-Shot-2019-04-08-at-7.48.02-PM
An example is the “gender” field on an Employee. We originally had little clue how this was used. By auditing the codebase, we found that it is used for several benefits enrollment flows, and therefore admins with the Manage Benefits permission would need access to read and modify it.

We were essentially going over our API with a fine-tooth comb. In our scrutiny, we found parts of our API that were no longer in use, and instances where the API could be simplified. Veteran Gusto engineers on the team helped identify controllers that were written long ago so we could update them to adhere to our current conventions. It is rare that an engineering team can dedicate the resources to do this sort of tech debt work, so we took the opportunity to make the codebase more consistent and delete a large amount of code as a result.

Design for flexibility

To decide what types of permissions to provide in our app, the product and design teams conducted interviews with customers to learn more about their needs. We honed in on eight permissions that corresponded to core job functionalities, such as Manage Benefits and Pay People. We originally had more, but we didn’t want to prematurely slice our app in a way that wouldn’t map well to our customers’ real world use cases. Since we were in a learning phase for the overall product, we expected the set of permissions to change.

Along with the natural uncertainty that comes along with launching a new feature, we expected that the following types of changes to permissions would be quite common:

  1. A new permission is added or an existing permission expands. For example, we were building a new Time Tracking feature, so we already knew this type of change was on the horizon.
  2. An existing permission is split into two. We had decided to err on the side of simplicity for our first set of permissions, so there was a large risk that some of the permissions would need to become more granular.
  3. A permission is deleted, or part of its functionality goes away, since features come and go.

In addition, the eight permissions we envisioned overlapped. Two of them in particular, Pay People and View Financial Reports, both needed read access to payroll data, bank reports, and more.

Thinking about how this would look in our codebase, a simple approach for fetching payroll data could look like the following:

if user.is_admin? && (user.has_permission?(PAY_PEOPLE) || user.has_permission?(VIEW_FINANCIAL_REPORTS))
  # respond with payroll data
else
  # raise 401 Unauthorized
end

However, evaluating the simple approach against the types of changes that we knew would be made to permissions, we realized that it would be quite painful to make those changes.

  1. When a new feature is added, we just need to permission the new code. This one is not too painful since the scope of the new code being added is contained.
  2. When an existing permission is split in two, we would need to update each place that the existing permission is referenced, and decide whether it is needed by both the new permissions, or just one of them. This work is not well contained, and each of these decisions being made may require re-auditing the code to remember how it’s used.
  3. We have a similar issue when we delete a permission, or when part of its functionality is removed. For each place we reference it, we need to update the application logic, although this is easier than #2 where we would need to re-audit the code. Not to mention updating the tests…

With these considerations, we decided to introduce an in-between layer of granular permissions that are mapped from our user-facing permissions to something consumable in our codebase. For flexibility, we used an in-memory map that looks something like this:

PERMISSION_MAP = {
  ToggleablePermission.PAY_PEOPLE => [READ_PAYROLLS, MANAGE_PAYROLLS, READ_TIME_OFF],
  ToggleablePermission.VIEW_FINANCIAL_REPORTS => [READ_PAYROLLS, READ_ACCOUNTING_INTEGRATIONS],
}

For each admin, we can use the map to look up which granular permissions they have, and then the API code can look more like this:

if user.is_admin? && user.has_permission?(READ_PAYROLLS)
  # respond with data
else
  # raise 401 Unauthorized
end

To separate permissions from the API controllers altogether, we used the authorization gem CanCanCan. The library enables you to specify which resources are accessible within a particular context using their rule-based authorization definitions. With our own lightweight DSL wrapper, our authorization specification looks like this (in a separate file where admins’ abilities are defined):

subject(Payroll, company_id: params(:company_id)) do
  can [:read], with: Permissions::READ_PAYROLLS
  can [:create, :update, :destroy], with: Permissions::MANAGE_PAYROLLS
end

With this, our API code doesn’t need to know about the new permissioning system:

authorize :read, resource # raises 401 if you don’t have access to the resource

These simple abstractions helped make the code more manageable and separates the business logic of the toggleable permissions from individual resource endpoints. In the example above, reading payroll data is needed across permissions. For other resources, we expect them to always be bundled under one permission. For instance, we expect the data that enables our employee time off features to always be managed under a single permission. With this abstraction, we have the flexibility to split permissions apart into as big or small of chunks as needed. If we evaluate this solution against our original framework, the expected changes outlined earlier have become easier to make. In fact, since the project went live, we’ve already seen changing product requirements (splitting a permission in two), and the team addressed this easily because of this design.

Choose the right rollout plan

Working on a long-running project can not only be challenging from a project planning perspective, but also for thinking about how to release it. At Gusto, it is very common to put our in-progress features behind a feature flag. In our case, since the new permission system would result in an extensive number of changes, it would be a disaster to put all of our changes behind a feature flag, as inactive code paths are in danger of becoming stale and broken.

Our existing admins had full access to the app, and we wanted to migrate them to have the Full Access permission after our rollout. Having the Full Access permission would be implemented by granting all of the granular permissions mentioned in the above section, and would functionally be the same as being an admin in the existing app.

Given this requirement, we had the idea to migrate all of the existing admins to the new permissioning framework immediately, and give them the Full Access permission. For all unmigrated endpoints, admins would still have access to them just as before. For all newly permissioned endpoints, existing admins would have access by having the Full Access permission. This meant the changes to authorization logic would go live as soon as we deployed them, thus curtailing the amount of stale code added.

Even if you don’t have a Full Access permission as we did, you can still create an ephemeral Full Access permission that exists while your permissioning project is in-flight.

With this implementation, the only part we feature flagged was the ability to edit an admin’s permissions.

Make it hard to do the wrong thing

This may be the most important learning. Any code that deals with authorization is a matter of security, and our customers trust us to secure their data appropriately. When you transition from a world where admins have access to everything, to a system where you have field-level authorization, you want to make sure that even if an engineer has missed all of your PSA’s about your upcoming changes, they aren’t putting your customers at risk.
lubo-minar-736214-unsplash-1
Who would ever possibly ignore a warning?

The plan was for our team to build the initial version of the project, but all new features would need to be permissioned by their respective teams.

One question we asked when we tackled each part of the architecture was what could happen if someone forgot about permissions.

If your engineering organization is as large as ours, this isn’t just likely to happen, it’s an inevitability. We decided to be cautious around default behaviors, log warnings as a safety guard, and write convenient test helpers to help surface when something goes wrong.

Be cautious around default behaviors

When logging in to Gusto, admins land on their dashboard, which shows a list of to-do items that the admin should address. Since these would only be actionable to admins that have the appropriate permission, we needed to figure out how to hide these to-do items conditionally. We decided to make each to-do item hidden by default, meaning if it’s not explicitly mapped to a required permission, it will be hidden from the user. Because of this, as a feature developer, it should be pretty obvious that you’re doing something wrong if you forget to consider permissions, as any level of manual or automated testing should fail. By hiding the notification by default, we mitigate the risk of exposing sensitive information to the wrong users in case of engineering error.

Logging of warnings

Despite the warning flags mentioned above, if a to-do item gets added without a required permission, as a safety guard, we log a warning that gets surfaced in our monitored errors. This way, we have an extra layer of protection against a developer having done the wrong thing.

Convenient test helpers

We created test helpers that could be dropped into API tests to validate access to the endpoint based on whether or not you have the required granular permission. In creating them, we wanted to make them as easy to use as possible. One of the controller helpers looked something like this:

it_behaves_like ‘a read action that requires permission’, Permissions::MANAGE_BENEFITS

To invoke this test helper, you specify the required permission for the endpoint. Under the hood, the test verifies that the endpoint is accessible with the permission, and is not accessible when the user has all other permissions except the one that’s required. The easier it is to test, the more likely developers will write tests.

Conclusion

Living in the intersection of software framework design and security, this project presented interesting and unique challenges. While exciting, we had many lessons along the way. This was a large project both in terms of time and resources, and the area of code that we touched. However, all of these learnings fell in line with the overarching principle, and one of Gusto’s company values, “Don’t optimize for the short term”.

My team and I would love to hear about your experiences of executing on similar types of projects that live in this intersection of software and security. Feel free to reach out to me on twitter.


Special thanks to Upeka Bee and Noa Elad for their feedback on earlier drafts of this post.