Google BigQuery
Google BigQuery (opens in a new tab) is a fully managed, highly scalable, and serverless data warehouse and analytics platform that enables organizations to store, query, and analyze massive amounts of data with ease.
Prerequisites
In order to integrate with Google BigQuery, you'll need to download or copy a Service Account Key
file.
To find this file, please:
- Log into the Google Cloud Console (opens in a new tab).
- Navigate to
IAM & Admin
→Service Accounts
and locate an account with appropriate permissions (bigquery.datasets.get
&bigquery.jobs.create
). - In the
Actions
column for the account, click on the three dots icon (⋮) and selectManage keys
. - Create a new key (JSON) for use with Fides and the key file will be downloaded to your computer.
- Open the file and copy the contents to paste into the integration form.
Integrating with Google BigQuery
To integrate an existing system with Google BigQuery:
- Navigate to Data map → View Systems and choose the system that you want to connect to
- Click on the Integrations tab.
- Pick the vendor or system type that you want to integrate with from the Connection type drop-down menu.
- Complete the required fields for the integration and click Save.
You may confirm your connection is working using the Test connection button, if desired.
To learn more, please see our guide for Managing Integrations.
Configuring a Dataset
A dataset is required to configure an integration to this database. A dataset is a configuration file that tell Fides how to query and update data in the database. To learn more about how to create and manage datasets, please see our Managing datasets guide.
An example BigQuery dataset can be viewed in Github (opens in a new tab).
Dataset Namespace
BigQuery datasets support namespaces. This means that the dataset can contain the project ID and dataset ID that will be used when accessing or masking data from a given BigQuery table. If namespace information is not included, the dataset will use the default project ID and dataset ID from the associated integration.
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
fides_meta:
namespace:
project_id: prj-bigquery-123456
dataset_id: test-dataset-1
...
Masking Strategy Overrides
Typically, erasure requests for database integrations issue "update" statements instead of "delete" statements, based on your policy-level masking strategies. However, the Fides BigQuery integration optionally supports issuing "delete" statements via collection-level masking strategy overrides.
To delete matching records in a collection as part of an erasure request, add the delete
masking_strategy_override
to the collection-level fides meta
. This will be prioritized over any policy-level masking strategies. As with typical
erasure requests for database integrations, a primary key must be specified for the collection.
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
collections:
- name: employee
fides_meta:
masking_strategy_override:
strategy: delete
fields:
- name: id
data_categories: [user.unique_id]
fides_meta:
primary_key: True
...
Erase After
To specify that a certain collection should be erased after others, use the erase_after
attribute under collection-level
fides_meta
:
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
collections:
- name: address
fides_meta:
erase_after: [ bigquery_example_test_dataset.employee ]
erase_after
is generally useful for avoiding FK constraint violations or to ensure that dependent records are erased first.
Partitioning Strategy
The Fides BigQuery integration supports partitioning strategies (opens in a new tab) for datasets, which may be required by your BigQuery dataset config. To specify a partitioning strategy, add the partitioning
attribute under collection-level fides_meta
:
dataset:
- fides_key: bigquery_example_test_dataset
name: BigQuery Example Test Dataset
description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
collections:
- name: employee
fides_meta:
partitioning:
where_clauses: [
"`created` > DATETIME_SUB(CURRENT_DATETIME(), INTERVAL 1000 DAY) AND `created` <= CURRENT_DATETIME()",
"`created` > TIMESTAMP_SUB(CURRENT_DATETIME(), INTERVAL 2000 DAY) AND `created` <= TIMESTAMP_SUB(CURRENT_DATETIME(), INTERVAL 1000 DAY)"
]
fields:
- name: email
data_categories: [user.contact.email]
fides_meta:
identity: email
data_type: string
- name: created
data_categories: [system.operations]
...
If a valid partitioning
attribute is configured on a collection
, DSR execution will execute one query per specified where_clauses
string per node per DSR request.
Validation for partitioning
and where_clauses
To minimize the risk of malicious SQL injection attacks, Fides enforces some
fairly restrictive validations on the partitioning
attribute. We enforce that
it has the required and expected where_clauses
key, whose value must be a list
of strings.
The string values are validated to ensure they match the expected syntax, which is strictly prescribed. The string values MUST be a valid SQL clause that defines a partition window, with the form:
`column_1` >(=) [some value] AND `column_1` <(=) [some value]
To be clear, some notable constraints on the input:
- the clause string must begin by referencing a column name wrapped by backticks (`)
- the clause string must compare that first column with a
<>(=)
operator, and may include at most one other conditional with a<>(=)
operator that's joined to the first conditional via an AND operator - if the clause string contains a second conditional, it must reference the same column name as the first conditional, also wrapped by backticks
- column names (wrapped by backticks) must always be on the left side of the
<>(=)
operator in its conditional