For privacy engineers to build privacy directly into the codebase, they need agreed-upon definitions for translating policy into code. Ethyca CEO Cillian unveils an open source system to standardize definitions for personal data living in the tech stack.
Back in July, I articulated some of Ethyca’s driving ideas in a Stack Overflow feature: Privacy is an afterthought in the software lifecycle. That needs to change.
To solve the problem I described, I believe the dev community needs to agree upon and build an open source definition language and set of tools for privacy.
The purpose of these tools is simple:
The two key benefits yielded are (1) ensuring that software systems more easily comply with complex data privacy regulations that are already forcing change on the tech community, and (2) ensuring that the products we build more naturally respect the rights of users.
Over the last three years at Ethyca, we’ve been working hard with technical design partners and engineering teams at some of the world’s biggest tech companies to understand the root cause of privacy challenges and build the tools necessary to solve these from the ground up. I’m excited that in the coming months we’ll finally share the culmination of that work with the first public release of our open source privacy tools.
The foundation of those tools is a consistent understanding of types and uses of data, and so I want to share the first public release of our data taxonomy with you today.
Its purpose is to create an agreed-upon definition of types and categories of personal data. This is a vital first step in any ontological design as, in order for any privacy standard to be interoperable, we must first achieve a shared definition of what types of data there are.
I’m eager to share this and excited to get any feedback that might improve this standard for every developer. I believe what follows forms the foundation of any realistic solution to privacy engineering.
The rest of this post breaks down our current thinking on what will be a freely available and extensible taxonomy: its components, hierarchy and syntax.
As stated briefly above and in my other posts, if the dev community is going to solve privacy, we need to agree on a standard definition language on which to base our understanding of systems, the risk they pose and our ability to codify healthy policies into them — ultimately, an ontology for privacy (more on that in a future post).
The starting point for that is a definition of entities in a system: a taxonomy, the ability to describe what types of data we are handling and what we’re using them for. If we can make it easy for any dev to do this as part of their implementation process, we can start to bake privacy naturally into healthy software design and delivery processes.
So a lot rides on getting the dev community to standardize their way of describing privacy data and privacy-related data processes in their systems. The taxonomy that follows is our attempt to publicly start that process, and we want to encourage as many developers as possible to contribute their views so that we can collectively build better technology.
This post marks the day we’re starting to release years of development work at Ethyca, as we had always intended, for the open source community. With that in mind, feel free to grab the taxonomy repo available below or use the codepen to explore the structure visually.
or codepen here:
Our goal at Ethyca was to conduct the detailed research necessary to provide the dev community with a first draft taxonomy robust enough to capture a comprehensive view of privacy, yet intuitive enough for any engineer to easily apply.
To achieve this, we evaluated existing privacy ontologies and their taxonomies, such as PrOnto and COPri v.2; international standards like ISO 19944; and contrasted with major data privacy regulations like the GDPR (the ICO’s website is a helpful read for those curious), CCPA, LGPD and drafts of the Indian PDPB.
Armed with this analysis and feedback from our technical design partners, we’ve been refining this taxonomy for over a year. We feel that it’s an early but confident first step in capturing everything you will need to describe the privacy behaviors and data types of your tech stack.
To achieve this, we made some early, opinionated and intentional decisions that I’d love feedback on. So the taxonomy repo you’ll see here:
Conceptually, the taxonomy is segmented in four groupings as follows:
Data Categories are a comprehensive hierarchy of labels to represent types of data in your systems. They can be coarse definitions such as “User Provided Data” or fine grained ones, such as “User Provided Email Address”. We’ll dig into this in a bit more detail shortly.
Data Use Categories are labels that describe how, or for what purposes, you are using the data. This branch of the taxonomy creates a structure for the most common uses of data in software applications. An example might be the use of data for payment processing or first party personalized advertising. These would both be data uses, ways in which your system uses data.
“Subjects” is a slightly esoteric term, common in the privacy industry to represent the user type — that is to say, the label applied to describe the provider or owner of the data. E.g., if you have an email address in your system, it might belong to an employee of the company, or to a customer. In this case, employees and customers are both “subjects” of the system. Under various privacy regulations, they are afforded rights over managing how their data is used and that may vary by the subject type, so the distinction between a patient in a medical records system and a customer in an e-commerce system is important.
Data Identification Qualifiers provide added context as to the degree of identification and therefore, potential risk using the data might pose relative to identifying an individual. A simple way to think about this is a spectrum: on one end is completely anonymous data, i.e. it is impossible to identify an individual from it, and on the other end is data that specifically identifies an individual. Along this spectrum are various labels that denote the degree of identification that a given data might provide.
We worked through various potential syntaxes to ensure it’s as easy as possible to read and write in plain English. Ultimately, dot notation lends itself most appropriately to writing statements that are concise and easy to understand. So the branch structure is dot notation and you’ll see some of the end nodes are complex terms that use snake case for clarity.
As such, if you were attempting to use the taxonomy of data categories to label, for example, an email address, the resulting hierarchical notation would look like:
As you can see, it’s pretty easy to deduce from this that it’s user-provided data, it identifies them and it is considered contact information — more specifically, an email address.
The design of these hierarchical structures is intended to allow any dev implementing this to type out the classification to a level of specificity that suits their needs. If I take the above example, my team and I might decide that we’re satisfied if we know it’s identifiable data about a user, which would look like:
Or labeling it with slightly more specificity as contact data:
As stated, the objective with releasing the taxonomy now is to discuss, debate and over time, iteratively improve the taxonomy so that it can cater to most common scenarios for devs and data teams in describing systems. Here we’ll delve into the current structure and our rationale for their present state.
As you can see from the dot notation example above and the taxonomy visualizer tool in the header, Data Categories are classified into primitive categories with a hierarchy of branches and nodes that allow for degrees of precision when classifying data.
Here’s a breakdown of that structure:
There are three top-level categories:
In defining these, we looked at the cross-labeling implications of data types such as an email address, which may be both account data and user data. This is a logical multi-label assignment so that you can manage this data for both purposes, or perhaps create exclusionary rules related to its use.
TLDR: When considering and testing labeling extensively, we found that with three clear primitives you could elegantly construct a series of labels that covered the broadest possible data types while limiting the number of terms needed to do so.
For each top-level node, there are multiple branches that provide richer context. You will see for the first two, account and system, these are limited, where the user node provides subclasses suitable to assist in detailed personal data management.
Most of these are likely self-evident, Of note here are the derived and provided labels, as these respectively describe where data was derived by the system through observation or inference, versus explicitly provided by a user.
As you can see, the hierarchy supports simple labels or, where necessary, very precise and fine-grained annotations. It’s easiest from here for you to dive in and play with the classifications yourself. However, we’ll quickly look at level three of the user branch specifically, where you have branches from derived and provided:
You can see this is split into identifiable and non-identifiable data:
And a similar split applies to provided, as shown below.
Similar to Data Categories, for Data Use, we’ve attempted to capture the widest variety of data use cases with the briefest hierarchy we can. In addition to this, we’ve captured all of the use cases described by ISO 19944 and GDPR to ensure that a single taxonomy can describe data uses across data privacy frameworks.
You’ll see if you use the taxonomy explorer in the header that this currently breaks down as follows:
At present there are seven top-level nodes to the use categories taxonomy branch. We think this still needs work and are continuing to optimize. However, today they are:
From here, it’s likely quicker to explore the second level data use categories yourself. However, it’s worth noting that we’ve attempted to capture the most common constructs that create privacy risks. For example:
These two examples show really important distinctions of use. The first is where your product is performing personalized marketing/advertising by receiving and processing data from a third party. Whereas the second example declares that your system is sharing data with a third party for their use in advertising — very different uses and privacy implications!
As a final word on data use categories, I stated at the top of this post that we’ve designed this with extensibility in mind, and data uses are a really effective example of this. Every business or software system is different and as such, you’re likely to have different or industry specific uses for your system.
The objective therefore with data uses is to create a simple framework to generate a clear hierarchy of classifications, so that you can quickly extend this for your use, whether it’s medical data use or some other sensitive process.
Finally, if you look at the repository history, you’ll see we’ve been iterating on structure from snake_case to dot notation and also hierarchy of terms. I’m hopeful that we’ll continue to do this constantly with feedback from devs implementing this to ensure it satisfies real-world use cases.
This is likely the easiest group of the taxonomy to understand. At present it’s a flat structure with no hierarchy and represents the various types of users (aka subjects) that may be participants in your system. These could be users, customers, employees, patients, voters, etc.
You might ask why we’ve done this. The benefit of this specificity is future-proofing. As privacy regulations evolve we expect that certain groups of users’ data will be managed differently. The ability to assign one or multiple user types to your data assures that in future you can build policies and enforcements around data for any business or legal requirement. So you might decide that today you treat employee and customer data the same way, but you will have the flexibility to change retention policies on employee data in the future. As with everything in the taxonomy, you can extend this to support specific business cases. This flexibility means that a thoughtful system need not be fragile, rendered unworkable as soon as compliance requirements evolve. To the contrary: the ontology enables the system to be nimble, a vital quality in a landscape as dynamic as modern data and privacy compliance needs.
Subject categories are explicit in their meaning today, so:
Similar to subject categories, identification qualifiers are a flat structure today and represent the common groups of data with respect to their ability to identify an individual. They are:
In this post I’ve proposed a first draft of a privacy taxonomy, one that underpins much of the thinking we do at Ethyca. What we’re releasing today is just an up-down taxonomy. However, this precedes an entire ontology that provides a simple grammar to describe complex data flows and privacy-related behaviors in a software system. This has been at the heart of our work for nearly three years now.
I’m excited to finally start sharing that work publicly with the community, and I encourage feedback, debate and changes. By having these conversations, we can build a better standard and the tools necessary to make this easy for every dev to implement.
Over the coming weeks we’ll be releasing more details of our work in this space and the benefits created by these tools. We welcome your feedback, participation and contribution.
If you’d like to chat about anything here you can get me on Twitter, @cillian.
This article is the first installment in a three-part series from CEO Cillian Kieran on privacy devtools and their underlying systems. Explore the rest of the trilogy:
October 13, 2021