Google BigQuery
Google BigQuery (opens in a new tab) is a fully managed, highly scalable, and serverless data warehouse and analytics platform that enables organizations to store, query, and analyze massive amounts of data with ease.
Prerequisites
In order to integrate with Google BigQuery, you'll need to download or copy a Service Account Key
file.
To find this file, please:
- Log into the Google Cloud Console (opens in a new tab).
- Navigate to
IAM & Admin
→Service Accounts
and locate an account with appropriate permissions (bigquery.datasets.get
&bigquery.jobs.create
). - In the
Actions
column for the account, click on the three dots icon (⋮) and selectManage keys
. - Create a new key (JSON) for use with Fides and the key file will be downloaded to your computer.
- Open the file and copy the contents to paste into the integration form.
Integrating with Google BigQuery
To integrate an existing system with Google BigQuery:
- Navigate to Data map → View Systems and choose the system that you want to connect to
- Click on the Integrations tab.
- Pick the vendor or system type that you want to integrate with from the Connection type drop-down menu.
- Complete the required fields for the integration and click Save.
You may confirm your connection is working using the Test connection button, if desired.
To learn more, please see our guide for Managing Integrations.
Configuring a Dataset
A dataset is required to configure an integration to this database. A dataset is a configuration file that tell Fides how to query and update data in the database. To learn more about how to create and manage datasets, please see our Managing datasets guide.
An example BigQuery dataset can be viewed in Github (opens in a new tab).
Dataset Namespace
BigQuery datasets support namespaces. This means that the dataset can contain the project ID and dataset ID that will be used when accessing or masking data from a given BigQuery table. If namespace information is not included, the dataset will use the default project ID and dataset ID from the associated integration.
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
fides_meta:
namespace:
project_id: prj-bigquery-123456
dataset_id: test-dataset-1
...
Nested Identity Support
BigQuery and MongoDB are the only integration types that supports nested identities within a dataset. This means that the identity field can be included inside an object field (STRUCT in BigQuery) and the access and erasure requests for this collection will correctly use the nested identity. This is supported for both default and custom identities.
Here's an example of how to define a nested identity field in a BigQuery dataset:
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
collections:
- name: customer
fields:
- name: id
- name: contact_info
fides_meta:
data_type: object
fields:
- name: email
data_categories: [user.contact.email]
fides_meta:
identity: email
data_type: string
In this example, email
is a nested field within the contact_info
object field and is marked as an identity field. Fides will generate the following select statement
SELECT id, contact_info FROM `customer` WHERE (contact_info.email = <IDENTITY_EMAIL>)
Masking Strategy Override: Delete
Typically, erasure requests for database integrations issue "update" statements instead of "delete" statements, based on your policy-level masking strategies. However, the Fides BigQuery integration optionally supports issuing "delete" statements to delete entire rows using a collection-level masking strategy override.
When configured, the delete
strategy will delete matching rows from a specific table (ie. "collection") when processing an erasure request. To configure this, modify
the fides_meta
annotation for the collection whose rows should be deleted and add a masking_strategy_override
object with a strategy: delete
value like in this example:
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
collections:
- name: employee
fides_meta:
masking_strategy_override:
strategy: delete
fields:
- name: id
data_categories: [user.unique_id]
fides_meta:
primary_key: True
...
In the example above, erasure requests for the employee
table will prioritize the strategy: delete
override over any policy-level masking strategies, causing entire
rows to be deleted instead of selectively masked. As with typical erasure requests for database integrations, at least one primary key or identity field must be specified
for the collection.
Erase After
To specify that a certain collection should be erased after others, use the erase_after
attribute under collection-level
fides_meta
:
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
collections:
- name: address
fides_meta:
erase_after: [ bigquery_example_test_dataset.employee ]
erase_after
is generally useful for avoiding FK constraint violations or to ensure that dependent records are erased first.
Partitioning Strategy
The Fides BigQuery integration supports partitioning strategies (opens in a new tab) for datasets, which may be required by your BigQuery dataset config. To specify a partitioning strategy, add the partitioning
attribute under collection-level fides_meta
:
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
collections:
- name: employee
fides_meta:
partitioning:
where_clauses: [
"`created` > DATETIME_SUB(CURRENT_DATETIME(), INTERVAL 1000 DAY) AND `created` <= CURRENT_DATETIME()",
"`created` > TIMESTAMP_SUB(CURRENT_DATETIME(), INTERVAL 2000 DAY) AND `created` <= TIMESTAMP_SUB(CURRENT_DATETIME(), INTERVAL 1000 DAY)"
]
fields:
- name: email
data_categories: [user.contact.email]
fides_meta:
identity: email
data_type: string
- name: created
data_categories: [system.operations]
...
If a valid partitioning
attribute is configured on a collection
, DSR execution will execute one query per specified where_clauses
string per node per DSR request.
Validation for partitioning
and where_clauses
To minimize the risk of malicious SQL injection attacks, Fides enforces some
fairly restrictive validations on the partitioning
attribute. We enforce that
it has the required and expected where_clauses
key, whose value must be a list
of strings.
The string values are validated to ensure they match the expected syntax, which is strictly prescribed. The string values MUST be a valid SQL clause that defines a partition window, with the form:
`column_1` >(=) [some value] AND `column_1` <(=) [some value]
To be clear, some notable constraints on the input:
- the clause string must begin by referencing a column name wrapped by backticks (`)
- the clause string must compare that first column with a
<>(=)
operator, and may include at most one other conditional with a<>(=)
operator that's joined to the first conditional via an AND operator - if the clause string contains a second conditional, it must reference the same column name as the first conditional, also wrapped by backticks
- column names (wrapped by backticks) must always be on the left side of the
<>(=)
operator in its conditional