Annotate Complex Fields
Fides can retrieve and mask data from complex objects and arrays in MongoDB. This involves annotating your dataset files to let Fides know about your complex data.
Declare an object field
To declare an object
field, define nested fields underneath that field. In the example below, workplace_info
is an object field with two nested fields: employer
and position
.
Data categories cannot be specified at the object
level due to potential conflicts with nested fields. Instead, annotate the scalar fields within the object field. Here, the workplace_info.position
field has data_category: user.job_title
.
dataset:
- fides_key: mongo_nested_object_example
name: Mongo Example with Nested Objects
description: Example of a Mongo dataset that contains 'details' about customers defined in the 'postgres_example_test_dataset'
collections:
- name: customer_details
fields:
- ...
- name: workplace_info
fidesops_meta:
data_type: object
fields:
- name: employer
fidesops_meta:
data_type: string
- name: position
data_categories: [ user.job_title ]
fidesops_meta:
data_type: string
- name: id
Reference a nested field
To define a relationship between a field on one collection and a nested field on another collection, use dot notation in the fidesops_meta
references for as many levels are necessary.
In the example below, this field is denoted by <collection_name>.<field_name>.<sub_field>
name, or
customer_details.workplace_info.id
.
This relationship could also be defined on the customer_details.workplace_info.id
field itself,
with a direction of from
, with field mydatabase.customer.workplace_id
, and dataset mydatabase
.
dataset:
- fides_key: mydatabase
name: internal database
description: our internal database of customer data
collections:
- name: customer
fields:
- name: workplace_id
data_categories: [system.operations]
fidesops_meta:
references:
- dataset: mongo_nested_object_example
field: customer_details.workplace_info.id
direction: to
...
Declare an array field
There is no official array
type. Instead, an array is represented by a []
flag on a field.
Declare an array of scalar values
In this example, the mydatabase:customer
collection has a travel_identifiers
field that is an string array,
described by data_type: string[]
. An array of integers would be described by data_type: integer[]
.
dataset:
- fides_key: mydatabase
name: internal database
description: our internal database of customer data
collections:
- name: customer
fields:
- ...
- name: travel_identifiers
fidesops_meta:
data_type: string[]
data_categories: [system.operations]
Declare a nested array
In this example, the mydatabase:customer
collection has a nested workplace_info.direct_reports
string array.
We define direct_reports
as a subfield under workplace_info
, as well as add the data_type string[]
to direct_reports
.
dataset:
- fides_key: mydatabase
name: internal database
description: our internal database of customer data
collections:
- name: customer
fields:
- name: workplace_info
fidesops_meta:
data_type: object
fields:
- name: employer
fidesops_meta:
data_type: string
- name: position
data_categories: [ user.job_title ]
fidesops_meta:
data_type: string
- name: direct_reports
data_categories: [ user.name ]
fidesops_meta:
data_type: string[]
Declare an array of objects
In this example, the mydatabase:customer
collection has an emergency_contacts
object array field, or
embedded documents, denoted by data_type: object[]
. Each object in the emergency_contacts
array can contain a
name
, relationship
, and phone
field.
dataset:
- fides_key: mydatabase
name: internal database
description: our internal database of customer data
collections:
- name: customer
fields:
- name: emergency_contacts
fidesops_meta:
data_type: object[]
fields:
- name: name
data_categories: [ user.name ]
fidesops_meta:
data_type: string
- name: relationship
fidesops_meta:
data_type: string
- name: phone
data_categories: [ user.contact.phone_number ]
fidesops_meta:
data_type: string
Reference an array
Reference an array field as if it is any other field. You cannot currently reference a specific index in an array field, but Fides will search an array field for matches.
In this example, mydatabase:flights.plane
is an integer field that will be used to lookup records that match an integer
in the mydatabase:aircraft.planes
array.
dataset:
- fides_key: mydatabase
name: internal database
description: our internal database of customer data
collections:
- name: flights
fields:
- ...
- name: passenger_information
fields:
- name: passenger_ids
fidesops_meta:
data_type: string[]
- name: plane
data_categories: [ system.operations ]
fidesops_meta:
data_type: integer
- name: aircraft
fields:
- name: _id
data_categories: [ system.operations ]
fidesops_meta:
primary_key: True
data_type: object_id
- name: planes
data_categories: [ system.operations ]
fidesops_meta:
data_type: integer[]
references:
- dataset: mydatabase
field: flights.plane
direction: from
- name: model
data_categories: [ system.operations ]
fidesops_meta:
data_type: string
In this more complicated example, a field in an array of objects is used to look up a different field in an array of
objects in another collection. Potentially multiple values from mydatabase:customer.comments.comment_id
can be used
to query for corresponding values in mydatabase:conversations.thread.comment
. Because this field is in an array of objects,
multiple matches may be found.
dataset:
- fides_key: mydatabase
name: internal database
description: our internal database of customer data
collections:
- name: customer
fields:
- name: comments
fidesops_meta:
data_type: object[]
fields:
- name: comment_id
fidesops_meta:
data_type: string
references:
- dataset: mydatabase
field: conversations.thread.comment
direction: to
- name: conversations
fidesops_meta:
data_type: object[]
fields:
- name: thread
fields:
- name: comment
fidesops_meta:
data_type: string
- name: message
fidesops_meta:
data_type: string
- name: chat_name
data_categories: [ user.name ]
fidesops_meta:
data_type: string
Query an array
There are some assumptions made with array querying that may or may not fit with how your data is structured. If an array is an entrypoint into a collection (i.e., one collection references its array field), there is ambiguity around how the queries should be built. (e.g., AND versus OR, and whether only the matched indices or matched embedded documents within arrays should be considered).
Assumptions
- If an array is the entry point into a node, Fides will search for corresponding matches across the entire array. You cannot specify a certain index.
- Fides operates on "OR" queries. Data returned from multiple array fields will be flattened before being passed into the next collection.
- For example, Collection A returned values [1, 2, 3], and Collection B returned values [4, 5, 6]. Collection C has an array field that depends on both Collection A and Collection B. Fides will search Collection C's array field to return any record that contains one of the values [1, 2, 3, 4, 5, 6] in the array.
- By default, if an array field is an entry point to a node, only matching indices in that array are considered, both for access and erasures, as well as for subsequent queries on dependent collections where applicable.
- For example, a query on Collection A only matched indices 0 and 1 in an array. Only the data located at indices 0 and 1 will be returned, and used to query data on dependent collection C.
- This can be overridden by specifying
return_all_elements: true
on an entrypoint array field, in which case, the query will return the entire array and/or mask the entire array.
- Individual array elements are masked, not the entire array, e.g. ["MASKED", "MASKED", "MASKED"]
Example query traversal
This is an example traversal created from the test postgres_example
and mongo_test
datasets. Multiple collections point to or from complex objects and arrays. See the mongo_example_test_dataset.yml
for more information.