Skip to content

DataHub

This Integration requires Fides Cloud or Fides Enterprise. For more information, talk to our solutions team. (opens in a new tab)

DataHub (opens in a new tab) is an open-source metadata platform for the modern data stack. It enables data discovery, data observability, and federated governance for your data ecosystem.

Prerequisites

In order to integrate with DataHub, you'll need to collect the following information:

NameDescription
DataHub Server URLThe URL of your DataHub instance (e.g., http://localhost:8080 (opens in a new tab))
Access TokenA DataHub access token with appropriate permissions. See DataHub Authentication (opens in a new tab) for more details.
BigQuery CredentialsCredentials that will be used on both Fides and DataHub sides
Glossary NodeThe node name in DataHub's glossary where terms will be created

A Data Category in Fides is represented as a Glossary Term in DataHub. We use these two terms interchangeably.

Every Glossary Term managed by Fides is tied to a Glossary Term Group (named FidesDataCategories by default).

Datasets created by BigQuery are only supported at this time. This is verified by using the Dataset fides_meta.dataset.connection_type field.

How to setup

Setup on Fides Side

  1. Create a new BigQuery integration
  2. Create a monitor for this integration
  3. Scan using this monitor
  4. Create a Dataset from the result of the scan

Setup on DataHub Side

  1. Create a new BigQuery ingestion
  2. Run the Ingestion
  3. Verify that the DataHub datasets were created

Configure Integration

  1. Create a new DataHub integration using the DataHub server connection info:
    • DataHub Server URL
    • DataHub Token
    • Frequency (daily, weekly, monthly)
    • Glossary Node Name

Once this has been set up, Fides will sync the Data Categories according to the configured frequency.

How to use

Fides is always actively checking in the background for Dataset/DataHub integration due to sync. This process:

  • Runs once an hour
  • Looks for due DataHub integrations to sync (using the last_run_timestamp and frequency fields)
  • Processes related due Datasets to sync
  • Will run a fresh DataHub integration at the next o'clock hour

Using the API for triggering a sync

You can trigger a DataHub sync using the following API endpoint:

POST /api/v1/plus/connection/datahub/{connection_key}/sync

The endpoint expects a request body that accepts a list of dataset_ids to sync. If the list is empty, nothing will be triggered. More details can be found in the Swagger UI.