Blog

Ethyca Team

June 29, 2021

Data Erasure In Distributed Systems

My talk at this year’s Privacy Engineering Practice and Respect (PEPR) conference came on the heels of the Colorado House voting to pass the state’s comprehensive privacy legislation.

Encoding Respect

My talk at this year’s Privacy Engineering Practice and Respect (PEPR) conference came on the heels of the Colorado House voting to pass the state’s comprehensive privacy legislation. This regulatory news sums up one of my talk’s main points for engineers: users are looking for respectful systems, where respect is built into the processes that handle their data. Building respectful systems is both the right thing to do and—increasingly—the approach demanded by regulations worldwide.

For folks who could not attend PEPR, or those who want to revisit my talk, I’ve written up a couple of highlights on modern data erasure in distributed systems.

Key Take-Away #1: Personal data is never just a matter of deleting a single row. Method and order of deletion matter.

A single company will hold personal data across dozens of data systems, and these data systems often refer to one another. Thus, simply deleting a row or individual field could wreak havoc downstream on operations that, for example, rely on a string in a particular field that no longer exists. Maintaining these relationships is referred to as preserving “referential integrity”. In most case, erasure should not interrupt those existing data relationships held across tables. At the same time, erasure methods must be comprehensive, visiting every table in every data system at least once and effectively erasing all identifiable information. Failing to do so could yield a costly fine for non-compliance with the growing number of privacy regulations worldwide.

Performing erasure at scale requires a systematic traversal of all data systems that could hold users’ data. That said, the essential algorithm is fairly straightforward:

Receive a user’s erasure request and lookup their identity in your system
Collect initial row IDs that correspond to the user
For each row, collect all related row IDs, and repeat until all rows are identified
Erase personal data in all of the collected rows

Without getting too into the weeds on data structures, this approach effectively creates a tree. Starting at the base (or root), a set of row IDs are collected. Then for each of those row IDs, their related row IDs branch off from that root repeatedly, until no further relationships exist. These are the “leaves” of the tree, and the safest order for deletion for this kind of structure is to start at these at the leaves and work backwards down to the root. With this reverse-order traversal, the erasure process can safely be interrupted without leaving the system in a bad state. In fact, your database constraints may force this!

Key Take-Away #2: Avoid storing personal data in data warehouses whenever possible.

Structural differences between application databases and data warehouses make the latter less conducive to this iterative approach. Data warehouses often contain many copies of personal data, as they are often denormalized tables. Furthermore, warehouse tables are usually indexed completely differently than application databases, which means that typically fast operations (like querying by user ID) may become painfully slow. As a result, there are typically more tables to visit, less efficiently, so erasure in a data warehouse is generally slower than it is in an application database.

Nevertheless, erasure is possible in a data warehouse. When implementing the same general algorithm as above, visiting thousands of these rows iteratively can be inefficient. To improve this, I’d recommend looking for where you can run bulk updates of records, taking advantage of the partitions or indexes that do exist in your data warehouse to carry out the operation at scale.

Key Take-Away #3: Take a proactive privacy approach to reduce the overall complexity of data erasure.

As a company, you’re only responsible for erasing data that you actually store. It sounds almost vacuously true, but it’s increasingly important to emphasize data minimization—storing only the necessary amount of data, for the minimum amount of time—as a design mindset in building data systems.

As the demand for data-driven applications has exploded, developers have largely defaulted to collecting as much personal data as possible, since it’s convenient to do so. This mindset has simply been the industry norm. However, proactive privacy steps like maintaining detailed metadata on categories of personal data aren’t typically a part of the development lifecycle. Instead, these tasks happen infrequently after deployment, taking up a lot of time. I believe we can and should build privacy into the design of data systems. Respectful systems are systems that users deserve, and they simplify regulatory compliance so you can focus on engineering instead of ticking checkboxes. Take data minimization as an example. Developers can practice respectful design by minimizing the data they collect from the outset, and furthermore proactively erasing data after it has fulfilled its business purposes. For example: expire all collected session data every few hours. In this case, erasure takes care of itself!

Virtually every company will need to implement privacy measures. It’s only a question of whether you’ll implement them proactively and efficiently, or retroactively and weigh down your team’s ability to innovate in the process.

Conclusion

On a conceptual level, data deletion is one of the most straightforward ideas in modern data privacy. Putting into action across distributed systems, however, is more of a challenge. Modern data rights are here to stay (see: Colorado on the brink of passing the latest comprehensive privacy law in the US), and frankly we as developers can make respectful design choices to make it less difficult!

In my opinion, waiting for regulation to catch up to reality is pretty unnecessary; building more respectful systems makes better products, because they help you earn users’ trust. As I was preparing for PEPR, I came across a tweet from Dr. Ann Cavoukian, the scholar who put the Privacy By Design framework on the map. In a conversation about privacy vs. innovation, Dr. Cavoukian replied:

Get rid of the “vs.” You can have privacy AND innovation! You just need to proactively embed privacy-protective measures into the design of your operations — bake it into the code, by design!

Couldn’t have said it better myself.

Engineering Data Trust at Scale: A Conversation with Adrian Galvan, Senior Software Engineer
Adrian Galvan builds scalable, privacy-first integrations at Ethyca.

Read More
From Paper to Power: Reflections on the 2025 Consero CPO Summit
At the Consero CPO Summit, it was clear: privacy leaders are shifting from compliance enforcers to strategic enablers of growth and AI readiness.

Read More
JustPark Chooses Ethyca to Power Global Privacy and Data Governance
JustPark has selected Ethyca to power its privacy and data governance, enabling trusted, consent-driven data control as the company scales globally.

Read More
Closing the AI Accountability Gap: Solving Governance with Data Infrastructure
Without infrastructure to enforce it, AI governance becomes costly theater destined to fail at scale.

Read More
The Engineer’s Burden: Why Trustworthy AI Starts with the Data Layer
Trustworthy AI begins with engineers ensuring clean, governed data at the source.

Read More
Google Tag Manager Is Now a Legal Risk: German Court Ruling Redefines the Consent Perimeter
Key takeaways from a German court ruling that redefines consent requirements for using Google Tag Manager.

Read More

Ready to get started?

Our team of data privacy devotees would love to show you how Ethyca helps engineers deploy CCPA, GDPR, and LGPD privacy compliance deep into business systems. Let’s chat!

Speak with Us

Sign up to our Newsletter

Stay informed with the latest in privacy compliance. Get expert insights, updates on evolving regulations, and tips on automating data protection with Ethyca’s trusted solutions.

Ethyca Team

Data Erasure In Distributed Systems

Encoding Respect

Key Take-Away #1: Personal data is never just a matter of deleting a single row. Method and order of deletion matter.

Key Take-Away #2: Avoid storing personal data in data warehouses whenever possible.

Key Take-Away #3: Take a proactive privacy approach to reduce the overall complexity of data erasure.

Conclusion

Engineering Data Trust at Scale: A Conversation with Adrian Galvan, Senior Software Engineer

From Paper to Power: Reflections on the 2025 Consero CPO Summit

JustPark Chooses Ethyca to Power Global Privacy and Data Governance

Closing the AI Accountability Gap: Solving Governance with Data Infrastructure

The Engineer’s Burden: Why Trustworthy AI Starts with the Data Layer

Google Tag Manager Is Now a Legal Risk: German Court Ruling Redefines the Consent Perimeter

Ready to get started?

Sign up to our Newsletter

Products

Industry

Roles

Regulations

Learn & Connect

Company