Masking data is an essential part of modern privacy engineering. We highlight a handful of masking strategies made possible with the Fides open-source platform, and we explain the difference between key terms: pseudonymization and anonymization.
Pseudonymization and anonymization might seem like interchangeable terms—they both involve replacing data with other values. But understanding how to implement these distinct practices is key to your company’s privacy responsibilities. Conflating the two terms has contributed to one of the most consequential decisions in global privacy thus far in 2022. Whether your company is looking for peace of mind in processing data internationally or domestically, knowing these practices helps you better assess and mitigate privacy risk.
In short, pseudonymized data can be restored to its intelligible form by the party that performed the pseudonymization; whereas anonymized data cannot be de-anonymized by anyone, including the anonymizing party. Thus, anonymization places stricter limits on identifiability. The General Data Protection Regulation (GDPR) considers anonymous data to be information that cannot identify a unique individual by any reasonably likely method, including the use of outside sources to pinpoint the person associated with a record.
In casual conversation, someone might describe a survey form as “anonymous” because it doesn’t collect respondents’ names. Yet the form might collect other identifying information such as gender, date of birth, and postal code—a combination in itself that Dr. Latanya Sweeney found to uniquely identify the vast majority of Americans. The moral of the story is this: be intentional with your terminology. True anonymization is a strict and difficult standard to achieve.
When describing pseudonymization and anonymization, we use the term data masking. There are important business cases for both practices, and this blog post will explain how the Fides platform enables developers to implement nuanced legal needs in controlling data identifiability.
As a data engineer, you might be asked to fulfill a California resident’s erasure request. Or you are tasked with obfuscating purchase data that is no longer necessary. Both of these situations call for data masking: there is recognizable personal data that should be rendered unrecognizable.
At first glance, you might want to anonymize all personal data in your systems. It renders data the least identifiable and, theoretically, it minimizes end-users’ privacy risk and your company’s risk of non-compliance; anonymized data is notably exempt from GDPR requirements. You might even think of running a hard delete on all of the personal data that you are not currently using. While principles of data minimization and retention are important responsibilities, there are legitimate business reasons to consider other approaches that retain greater degree of data identifiability.
A hard delete could spell trouble and confusion for your databases. For instance, suppose there is an orders table that is linked to a users table. Deleting an entry in the users table could leave “orphan records” in the orders table, where the previously defined relationship—aka referential integrity—has been broken. This can contribute to technical privacy debt when data engineers revisit the orders table and must backtrack to figure out where this order data came from.
On the legal side, certain data should not be deleted in order to meet requirements such as tax documentation. In addition to accounting for purchase data, the data itself might be legally subject to a minimum retention window, which could extend for a year or more depending on the type of data and the legal jurisdiction.
Simultaneously balancing legal, technical, and ethical responsibilities is a hallmark of privacy engineering. Having an array of masking strategies can enable privacy engineers to be prepared for a variety of circumstances. Fides Ops equips developers with open-source tools for customizable masking strategies.
There are a variety of methods to mask data in Fides Ops. Understanding the basics of configuring a general masking rule in Fides Ops opens the door to a variety of specific applications to suit business needs.
In general, a masking rule in Fides Ops involves configuring a couple of straightforward settings. To implement these methods yourself, check out this guide for developers. In this section, we describe the method and motivation for four masking approaches that offer various degrees of identification.
If your defined database relationships and business obligations permit, it might be appropriate to simply replace data with a null value.
For example, the data
firstname.lastname@example.org would be output as
It’s important to exercise caution when implementing a null rewrite. Before taking this approach, confirm that the null rewrite would not break a downstream data operation that counts on a nonempty or specially formatted entry in the impacted database. When it comes to legal needs, make sure that there are no policy or regulatory requirements to retain the data.
To replace data with a fixed string such as
MASKED, a default string value would be the ideal approach. This tactic removes personal data, but unlike the null rewrite method, it can preserve custom formatting needs.
For instance, the data
email@example.com might be output as
With Fides Ops, a masking strategy can include a custom suffix, such as
@masked.com, to maintain formatting requirements for downstream data operations that might require data to be formatted as email addresses. With the random string value method and the more sophisticated encryption techniques up next, this custom suffix can also be appended to the encrypted output.
Instead of a fixed string such as
MASKED, you might wish to replace the personal data with a random string of a fixed length.
As an example of this masking strategy, the email address
firstname.lastname@example.org might be masked as
This method is, by definition, non-deterministic. Providing the same input repeatedly will lead to different outputs each time. This property is vital to consider before implementing, as its consequences include two separate instances of the same users’ email address being replaced by non-identical random strings.
When encrypting data according to the Advanced Encryption Standard (AES) or Hash-based Message Authentication Code (HMAC), each unique input will have a unique output. So any two pieces of distinct data will have distinct encrypted forms. But the same input will produce the same output, within the context of the same privacy request. This notion is called deterministic pseudonymization. One of the powerful applications of deterministic pseudonymization is that it upholds the referential integrity in the impacted databases, so useful business information persists for analytics and dashboards, while the data is no longer in plaintext. Below, we have an example using HMAC.
The email address
email@example.com could be masked as
With Fides Ops, the specific hash function for HMAC can be set to a variety of options, such as SHA-256 and SHA-512. Currently, Fides Ops implements AES using the GCM mode of encryption by default.
Different masking techniques suit different conditions, both in terms of technical capacity and legal obligations. In addition to knowing a variety of masking strategies, listening to legal and engineering teams’ input prior to implementation is key to effective masking strategy—and effective privacy operations in general.
One of the most useful applications of custom masking strategies is in fulfilling a user’s erasure request, which is a type of data subject request (DSR). DSRs are user requests to access, modify, or erase the personal data that a company holds on them. They are a pillar of global privacy, and effectively fulfilling DSRs is one of the most visible user-facing ways to demonstrate your commitment to respecting users’ privacy
When it comes to fulfilling DSRs, Fides Ops not only enables custom masking strategies; it also supports the automation of DSR workflows throughout your infrastructure. Alongside fulfilling access requests, Fides Ops automatically masks data when processing an erasure request—all in accordance with the specific conditions and settings that your organization configures. Your company can configure Fides Ops to ensure that masking strategies efficiently satisfy erasure requests while meeting the legal and technical requirements on your infrastructure.
Rendering all of your company’s personal data fully anonymous is rarely feasible from a technical or business standpoint. It is virtually inevitable that a modern business will collect, process, and retain personal data in a form that is more identifiable than strictly anonymous data is.
When data is more identifiable—that is, further from anonymity—it carries greater risk; data falling into the wrong hands or processed without legal justification has a more severe impact if that data can be associated with an individual. As a result, your company will need to meet more governance requirements with this data. For instance, a policy might require your company’s AI systems to only be trained on customer data that has been anonymized. In addition to orchestrating data masking, the Fides platform also offers open-source tools to support these granular privacy design choices in software development. This approachable tutorial describes how to systematically label data identifiability and other privacy characteristics directly in the codebase, enabling powerful applications like automatic privacy checks whenever new code is submitted to a software project.
Globally, privacy regulations and authorities are elevating their requirements for companies to only process personal data for legitimate purposes and legitimate timespans. Reckless data collection, murky conditions for data identifiability, and undefined data retention—these behaviors increasingly attract fines, investigations, and reputational damage.
Beyond avoiding bad practices, nuanced masking strategies are an opportunity to cultivate long-lasting relationships with your users and demonstrate that you are worthy of their trust.
Get started by cloning the Fides Ops repo, the free and open-source tool for developers to orchestrate privacy requests and execute nuanced masking strategies.
Ethyca’s VP of Engineering Neville Samuell recently spoke at the University of Texas at Austin’s Texas McCombs School of Business about privacy engineering and its role in today’s digital landscape. Read a summary of the discussion by Neville himself here.
Learn more about all of the updates in the Fides 2.24 release here.
Ethyca’s Senior Software Engineer Adam Sachs goes through the thought process of creating Fideslang, the privacy engineering taxonomy that standardizes privacy compliance in software development.
Learn more about all of the updates in the Fides 2.23 release here.
Our Senior Software Engineer Dawn Pattison walks you through implementing data minimization into your business.
Learn more about all of the updates in the Fides 2.22 release here.
Our team of data privacy devotees would love to show you how Ethyca helps engineers deploy CCPA, GDPR, and LGPD privacy compliance deep into business systems. Let’s chat!Request a Demo