DataHub
DataHub (opens in a new tab) is an open-source metadata platform for the modern data stack. It enables data discovery, data observability, and federated governance for your data ecosystem.
Prerequisites
In order to integrate with DataHub, you'll need to collect the following information:
Name | Description |
---|---|
DataHub Server URL | The URL of your DataHub instance (e.g., http://localhost:8080 (opens in a new tab)) |
Access Token | A DataHub access token with appropriate permissions. See DataHub Authentication (opens in a new tab) for more details. |
BigQuery Credentials | Credentials that will be used on both Fides and DataHub sides |
Glossary Node | The node name in DataHub's glossary where terms will be created |
A Data Category in Fides is represented as a Glossary Term in DataHub. We use these two terms interchangeably.
Every Glossary Term managed by Fides is tied to a Glossary Term Group (named FidesDataCategories by default).
Datasets created by BigQuery are only supported at this time. This is verified by using the Dataset fides_meta.dataset.connection_type
field.
How to setup
Setup on Fides Side
- Create a new BigQuery integration
- Create a monitor for this integration
- Scan using this monitor
- Create a Dataset from the result of the scan
Setup on DataHub Side
- Create a new BigQuery ingestion
- Run the Ingestion
- Verify that the DataHub datasets were created
Configure Integration
- Create a new DataHub integration using the DataHub server connection info:
- DataHub Server URL
- DataHub Token
- Frequency (daily, weekly, monthly)
- Glossary Node Name
Once this has been set up, Fides will sync the Data Categories according to the configured frequency.
How to use
Fides is always actively checking in the background for Dataset/DataHub integration due to sync. This process:
- Runs once an hour
- Looks for due DataHub integrations to sync (using the last_run_timestamp and frequency fields)
- Processes related due Datasets to sync
- Will run a fresh DataHub integration at the next o'clock hour
Using the API for triggering a sync
You can trigger a DataHub sync using the following API endpoint:
POST /api/v1/plus/connection/datahub/{connection_key}/sync
The endpoint expects a request body that accepts a list of dataset_ids
to sync. If the list is empty, nothing will be triggered. More details can be found in the Swagger UI.