Protecting sensitive data at Gusto with HAPII

Gusto is a custodian of some of your most sensitive personal and financial information. We take this responsibility seriously, and are constantly reviewing our security posture and data handling practices for ways to improve well beyond the industry standard.

Most software systems operate on the basis of perimeter security - if you can prove your identity and need for legitimate access at the door, you are allowed in. Once inside, however, you are essentially trusted to carry out just the tasks you are authorized to do. In the physical world, you would have an access badge with verifiable information that states what you’re allowed to do, and a combination of random checks (security guards walking around) and systematic checks (at doors and checkpoints) are used to try and prevent you from doing anything unexpected.

For the most sensitive information that organizations handle, there is often an entirely separate level of access and monitoring - a different colored badge, biometrics, manned checkpoints, steel vaults, and explicit entry and exit logs. The additional complexity and cost of these procedures are justified by the sensitivity of the data, and the magnitude of the impact if this data falls into the wrong hands.Over the last year, Gusto has built the digital equivalent of these extended procedures and protections for the most critical data we handle, like SSNs and bank account numbers. We call this system HAPII - the Hardened PII store, though in principle this system can be used to store virtually any data. This service acts as our isolated, secure vault for information. In this first part of a two part series, we will outline the design of this system, and how we have leveraged it to improve data handling practices in Gusto’s systems. In the second part, we will discuss the architectural and implementation details of HAPII, and how these provide secure and efficient access to sensitive data as needed.

Technical Challenges

Gusto is currently implemented as a Rails monolithic architecture - we have two large Rails applications that handle payroll and benefits related processes. Each is responsible for its own data storage and handling, with cross-communication for shared data models primarily directed by the payroll-related system, which existed first.

Behind these systems are large relational data stores, and object storage for items like generated tax documents and invoices. As is standard practice, these databases are encrypted at rest using keys that Gusto controls. Access to the database is only possible within our Virtual Private Cloud (VPC) networks, with strong credentials for database connections, and auditing of data queries.

Within our Rails applications, we utilize ActiveRecord to interact with the database. For data considered to be particularly sensitive, we utilize column-level encryption using a different key from the base at-rest encryption provided for the whole database.

These controls provide a sufficient degree of security for most forms of data, however there are many aspects that we felt could be improved:

In a monolithic application, access to data by code is uniform and observable only if auditing code is added specifically for this purpose. When handling field-level encrypted data, this is typically decrypted on first use, which requires broad availability of the decryption key throughout the environment. ActiveRecord result sets are processed and forwarded through multiple code paths, often resulting in broader exposure of data to more code modules than is strictly necessary.
Access to data in the store is uniform - there are no per-module restrictions within the monolith to request data from specific tables, at least at the level of the credentials used to establish database connections and dispatch queries.
In many cases, only a redacted form of sensitive data is actually needed - for example, often the last four digits of a social security number (SSN), or of a bank account number, are needed for a value to be recognizable while obfuscated. A common approach for redaction is to retrieve the unredacted form from the database and then redact in-memory - this still leaves a narrow window of opportunity for misuse of the unredacted form.
Sensitive data is encrypted at the column level, but not in a way that allows easy, automatic, and frequent rotation of encryption keys. This is a common operational security problem across the industry, where the encryption parameters for each stored value are implicit rather than explicitly retained as metadata.

A desire to address these issues resulted in the creation of the HAPII service and related infrastructure. We are in the process of migrating the most sensitive data Gusto holds to this system.

The HAPII service

HAPII is essentially a specialized type of encrypted key-value store, which operates in an environment that is comprehensively isolated from our existing infrastructure. HAPII’s goal is to securely store and handle data structures we simply refer to as objects, which have the following attributes:

id: A cryptographically-random 128-bit identifier. HAPII is responsible for generating these identifiers when objects are stored, and clients retain these identifiers to fetch data in the future. IDs are not derived from the data stored in an object in any way - there is no way to determine anything useful about an object from its ID, or vice versa.
type: A short human-readable type name which is used to define some expectations about the data stored for particular object classes, and for some basic type-based access control, via configuration. For social security numbers, the short type name is simply “ssn”.
full_value: The full security- or privacy-sensitive value. We must protect this as carefully as we can. For an SSN, this is simply the canonical string representation of the SSN. For more complex data, this may be a more deeply nested structure with multiple fields.
redacted_value: A redacted form of the full value. In order for HAPII to avoid embedding domain-specific knowledge of the data it stores, we expect clients to perform their own redaction of the full value prior to writing the object. When retrieving objects later, clients can request just the redacted form - in fact, this is the default. For SSNs, the redacted form is the typical “last four digits” representation.
search_value: An optional string value that is used for exact match searching of values based on text values that are derived from the full value. SSN search is permitted for exact match of the SSN in its full form, i.e. “123-45-6789”, while for values like postal addresses, we may perform a search based on the exact zip code: “94107-4345”.

Objects stored in HAPII are immutable - once written, they cannot be changed, though they can be marked for deletion. We utilize different purge strategies for deleting data of different types, as certain types must be retained for at least 18 months (or in some cases, years or decades) due to legal requirements around book-keeping of financial information.

As HAPII provides relatively low-level and fine-grained data, we were sensitive to the performance implications of externalizing this type of data from the existing data store. As such, we set tight latency goals for all of HAPII’s RPCs, with sub-10ms latencies expected at the 99th percentile for most operations. For a low-level service such as HAPII, latency incurred from interacting with it contributes to the minimum latency for any higher-level operation. If multiple requests are made to HAPII, that minimum latency can compound quickly.

Changing data usage patterns

HAPII’s service interface is deliberately simple. Clients must explicitly ask for data using gRPC client stubs, that we have deliberately not attempted to integrate into the more typical tools that Gusto engineers use to retrieve data, like ActiveRecord or GraphQL. This has the consequence that sensitive data access is even more explicit, and introduces opportunities to implicitly prompt developers with two important questions:

Do you really need access to this data in this context?
Is a redacted value sufficient, or is the full value needed?

Combined with increased observability of when data is retrieved from HAPII, unblended from SQL request logs or GraphQL requests with complex joins and filters, we have been able to monitor where data is retrieved. Through simple inspection of these data usages, we have asked more direct questions of our engineering teams to reinforce the importance of intent and underscore the need to keep processes and patterns secure - avoiding data retrieval, favoring redacted forms over full forms, and isolating usage to as small a section of the code as possible. A useful side-effect of these changes is that it also makes the product faster - the fastest query is one you never have to make!

By inspecting where SSNs were retrieved, we discovered multiple instances where the SSN was simply never used - it was being retrieved implicitly as part of a larger bundle of data retrieval, passed around as part of that bundle, but ultimately never read. These types of SSN retrievals could be replaced with a lazy retrieval, where the data is not actually fetched until the associated property is explicitly read - the “getter” for the property sends the read request to HAPII, rather than this value being pre-populated when the containing object is constructed.

In other instances, we discovered opportunities to change front-end display patterns to show the redacted form of the SSN, with a “reveal” action to show the full SSN if required. This makes the display of SSNs within front-end forms and reports more intentional.

We found that displaying the redacted form with a reveal button was highly effective. For 98% of cases where the redacted value was shown, the full value was never subsequently requested. We were therefore able to avoid retrieving and transporting the full SSN to the frontend in the vast majority of cases, leaving the sensitive data safely at rest in HAPII.

With improved observability of data retrieval and usage, a data driven case can be made for change, and concretely expressing the reduction in risk as a result. HAPII enables us to learn even more about what data usage patterns provide the best balance between security and usability for our product teams, and we are leveraging this experience to write a “playbook” that can better guide our teams on data handling practices.

Conclusion

HAPII’s design provides an even more secure and robust data storage mechanism for our customer’s most sensitive data. Combined with a detailed audit log of usage and observability into where data is used, we have been able to make substantial changes to how this data is used. Adoption of HAPII across multiple sensitive data types has resulted in a significant improvement in loss prevention.

In the next part, we will discuss the implementation details behind HAPII - the architecture and implementation strategies used, the cryptographic scheme for storing the sensitive data, and how these helped support the goals of the project. Until next time!

Engineering ( 50 )

Ruby On Rails ( 16 )

Advice ( 14 )

Career growth ( 14 )

Collaboration ( 13 )

Software Development ( 12 )

Diversity Inclusion ( 11 )

Programming ( 10 )

Modular Monolith ( 9 )

Security ( 9 )

Best Practice ( 9 )

Modularization ( 9 )

Gusto ( 8 )

Spaghetti Code ( 7 )

Ruby ( 6 )

Interviews ( 5 )

Sidekiq ( 5 )

Gradual Modularization ( 5 )

Teamwork ( 4 )

Product Management ( 4 )

Engineering Management ( 4 )

Tech Lead ( 4 )

Refactoring ( 4 )

Monolith ( 4 )

Coding ( 4 )

growth ( 4 )

Technical Strategy ( 4 )

Spaghetti Model ( 4 )

Teams ( 3 )

Startups ( 3 )

Database ( 3 )

Authorization ( 3 )

Guide ( 3 )

Data Engineering ( 3 )

Tidying ( 3 )

Api ( 3 )

Experiment ( 3 )

Decision ( 3 )

Gusto Values ( 3 )

Productivity ( 3 )

Performance ( 3 )

Payroll ( 3 )

Architecture ( 3 )

React ( 2 )

Javascript ( 2 )

PM ( 2 )

Job Hunt ( 2 )

Migration ( 2 )

Redis ( 2 )

Incident Response ( 2 )

Women ( 2 )

Rails Engines ( 2 )

Startup Lessons ( 2 )

Big Data ( 2 )

Getting Started ( 2 )

Data Lake ( 2 )

Pull Request ( 2 )

Code Review ( 2 )

Integrations ( 2 )

Outcomes ( 2 )

MVP ( 2 )

Client Platform Engineering ( 2 )

CI/CD ( 2 )

PII ( 2 )

Testing ( 2 )

Packwerk ( 2 )

Gem ( 2 )

Platform ( 2 )

Mentorship ( 2 )

Internship ( 2 )

Backbone ( 1 )

Webpack ( 1 )

Immutable ( 1 )

Cto ( 1 )

Double Write ( 1 )

Single Read ( 1 )

Models ( 1 )

Double Read ( 1 )

Single Write ( 1 )

Zero Downtime ( 1 )