Google BigQuery

This Integration requires Fides Cloud or Fides Enterprise. For more information, talk to our solutions team.

Google BigQuery (opens in a new tab) is a fully managed, highly scalable, and serverless data warehouse and analytics platform that enables organizations to store, query, and analyze massive amounts of data with ease.

Prerequisites

In order to integrate with Google BigQuery, you'll need to download or copy a Service Account Key file.

To find this file, please:

Log into the Google Cloud Console (opens in a new tab).
Navigate to IAM & Admin → Service Accounts and locate an account with appropriate permissions (bigquery.datasets.get & bigquery.jobs.create).
In the Actions column for the account, click on the three dots icon (⋮) and select Manage keys.
Create a new key (JSON) for use with Fides and the key file will be downloaded to your computer.
Open the file and copy the contents to paste into the integration form.

Integrating with Google BigQuery

To integrate an existing system with Google BigQuery:

Navigate to Data map → View Systems and choose the system that you want to connect to
Click on the Integrations tab.
Pick the vendor or system type that you want to integrate with from the Connection type drop-down menu.
Complete the required fields for the integration and click Save.

You may confirm your connection is working using the Test connection button, if desired.

To learn more, please see our guide for Managing Integrations.

Configuring a Dataset

A dataset is required to configure an integration to this database. A dataset is a configuration file that tell Fides how to query and update data in the database. To learn more about how to create and manage datasets, please see our Managing datasets guide.

An example BigQuery dataset can be viewed in Github (opens in a new tab).

Dataset Namespace

BigQuery datasets support namespaces. This means that the dataset can contain the project ID and dataset ID that will be used when accessing or masking data from a given BigQuery table. If namespace information is not included, the dataset will use the default project ID and dataset ID from the associated integration.

dataset:
  - fides_key: bigquery_example_test_dataset
    name: BigQuery Example Test Dataset
    description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
    fides_meta:
      namespace:
        project_id: prj-bigquery-123456
        dataset_id: test-dataset-1
    ...

Nested Identity Support

BigQuery and MongoDB are the only integration types that supports nested identities within a dataset. This means that the identity field can be included inside an object field (STRUCT in BigQuery) and the access and erasure requests for this collection will correctly use the nested identity. This is supported for both default and custom identities.

Here's an example of how to define a nested identity field in a BigQuery dataset:

dataset:
  - fides_key: bigquery_example_test_dataset
    name: BigQuery Example Test Dataset
    description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
    collections:
      - name: customer
        fields:
          - name: id
          - name: contact_info
            fides_meta:
              data_type: object
            fields:
              - name: email
                data_categories: [user.contact.email]
                fides_meta:
                  identity: email
                  data_type: string

In this example, email is a nested field within the contact_info object field and is marked as an identity field. Fides will generate the following select statement

SELECT id, contact_info FROM `customer` WHERE (contact_info.email = <IDENTITY_EMAIL>)

Masking Strategy Overrides

Typically, erasure requests for database integrations issue "update" statements instead of "delete" statements, based on your policy-level masking strategies. However, the Fides BigQuery integration optionally supports issuing "delete" statements via collection-level masking strategy overrides.

To delete matching records in a collection as part of an erasure request, add the delete masking_strategy_override to the collection-level fides meta. This will be prioritized over any policy-level masking strategies. As with typical erasure requests for database integrations, a primary key must be specified for the collection.

dataset:
  - fides_key: bigquery_example_test_dataset
    name: BigQuery Example Test Dataset
    description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
    collections:
      - name: employee
        fides_meta:
          masking_strategy_override:
            strategy: delete
        fields:
          - name: id
            data_categories: [user.unique_id]
            fides_meta:
              primary_key: True
    ...

Erase After

To specify that a certain collection should be erased after others, use the erase_after attribute under collection-level fides_meta:

dataset:
  - fides_key: bigquery_example_test_dataset
    name: BigQuery Example Test Dataset
    description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
    collections:
      - name: address
        fides_meta:
          erase_after: [ bigquery_example_test_dataset.employee ]

erase_after is generally useful for avoiding FK constraint violations or to ensure that dependent records are erased first.

Partitioning Strategy

The Fides BigQuery integration supports partitioning strategies (opens in a new tab) for datasets, which may be required by your BigQuery dataset config. To specify a partitioning strategy, add the partitioning attribute under collection-level fides_meta:

dataset:
  - fides_key: bigquery_example_test_dataset
    name: BigQuery Example Test Dataset
    description: Example of a BigQuery dataset containing a variety of related tables like customers, products, addresses, etc.
    collections:
      - name: employee
        fides_meta:
          partitioning:
            where_clauses: [
                "`created` > DATETIME_SUB(CURRENT_DATETIME(), INTERVAL 1000 DAY) AND `created` <= CURRENT_DATETIME()",
                "`created` > TIMESTAMP_SUB(CURRENT_DATETIME(), INTERVAL 2000 DAY) AND `created` <= TIMESTAMP_SUB(CURRENT_DATETIME(), INTERVAL 1000 DAY)"
            ]
        fields:
          - name: email
            data_categories: [user.contact.email]
            fides_meta:
              identity: email
              data_type: string
          - name: created
            data_categories: [system.operations]
       ...

If a valid partitioning attribute is configured on a collection, DSR execution will execute one query per specified where_clauses string per node per DSR request.

Validation for `partitioning` and `where_clauses`

To minimize the risk of malicious SQL injection attacks, Fides enforces some fairly restrictive validations on the partitioning attribute. We enforce that it has the required and expected where_clauses key, whose value must be a list of strings.

The string values are validated to ensure they match the expected syntax, which is strictly prescribed. The string values MUST be a valid SQL clause that defines a partition window, with the form:

`column_1` >(=) [some value] AND `column_1` <(=) [some value]

To be clear, some notable constraints on the input:

the clause string must begin by referencing a column name wrapped by backticks (`)
the clause string must compare that first column with a <>(=) operator, and may include at most one other conditional with a <>(=) operator that's joined to the first conditional via an AND operator
if the clause string contains a second conditional, it must reference the same column name as the first conditional, also wrapped by backticks
column names (wrapped by backticks) must always be on the left side of the <>(=)operator in its conditional

Amazon RDS PostgreSQL MariaDB