What is a SaaS configuration?
A SaaS connector is defined in two parts: the Dataset, and the SaaS configuration. The Dataset describes the data that is available from the connector, and the SaaS config describes how to connect to, and retrieve or update the data in the connector.
When accessing data from APIs, each application (and different endpoints within the same application) can follow different patterns, making their requirements different from Database connectors. Fides provides a flexible configuration to define different access/update patterns.
An example SaaS config
This guide will use a SaaS configuration to connect to Mailchimp. The configuration schema defines:
- The domain and authentication requirements for an HTTP client to access Mailchimp
- A test request for verifying the integration was set up correctly
- Endpoints to the following resources within the Mailchimp API:
GET
andPUT
for the members (opens in a new tab) resourceGET
for the conversations (opens in a new tab) resourceGET
for the messages (opens in a new tab) resource
The following is an example SaaS config for Mailchimp:
saas_config:
fides_key: mailchimp_connector_example
name: Mailchimp SaaS Config
type: mailchimp
description: A sample schema representing the Mailchimp connector for fides
version: 0.0.1
connector_params:
- name: domain
- name: username
- name: api_key
client_config:
protocol: https
host: <domain>
authentication:
strategy: basic
configuration:
username: <username>
password: <api_key>
rate_limit_config:
limits:
- rate: 10
period: second
test_request:
method: GET
path: /3.0/lists
endpoints:
- name: messages
requests:
read:
method: GET
path: /3.0/conversations/<conversation_id>/messages
param_values:
- name: conversation_id
references:
- dataset: mailchimp_connector_example
field: conversations.id
direction: from
data_path: conversation_messages
postprocessors:
- strategy: filter
configuration:
field: from_email
value:
identity: email
rate_limit_config:
enabled: false
- name: conversations
requests:
read:
method: GET
path: /3.0/conversations
query_params:
- name: count
value: 1000
- name: offset
value: 0
param_values:
- name: placeholder
identity: email
data_path: conversations
pagination:
strategy: offset
configuration:
incremental_param: offset
increment_by: 1000
limit: 10000
- name: member
requests:
read:
method: GET
path: /3.0/search-members
query_params:
- name: query
value: <email>
param_values:
- name: email
identity: email
data_path: exact_matches.members
update:
method: PUT
path: /3.0/lists/<list_id>/members/<subscriber_hash>
param_values:
- name: list_id
references:
- dataset: mailchimp_connector_example
field: member.list_id
direction: from
- name: subscriber_hash
references:
- dataset: mailchimp_connector_example
field: member.id
direction: from
SaaS config metadata fields
Attribute | Description |
---|---|
fides_key | Used to uniquely identify the connector, and to link a SaaS config to a dataset. |
name | A human-readable name for the connector. |
type | Type of SaaS connector. Choose from hubspot , mailchimp , or the other available connectors, or use custom for other types. |
description | Used to add a useful description. |
version | Used to track different versions of the SaaS config. |
The above configuration also contains the following complex fields
connector_params
client_config
rate_limit_config
test_request
endpoints
data_protection_request
Integration params
The connector_params
field is used to describe a list of settings which a user must configure as part of the setup. A default_value
can also be used to include values such as a standard base domain for an API or a recommended page size for pagination. Make sure to not include confidential values such as passwords or API keys, these values are added as part of the Connection secrets. When configuring a connector's secrets for the first time, the default values will be used if a value is not provided.
connector_params:
- name: domain
default_value: api.stripe.com
- name: username
- name: password
- name: page_size
default_value: 100
External references
The external_references
field is used to describe a list of external dataset references that a user must configure as part of the setup. These dataset references point to fields in a registered Fides dataset that are used to provide necessary values to the SaaS integration execution. The location of these fields depends on the use-case, so they cannot be determined beforehand, and therefore must be configured at the time of setting up the SaaS connector.
external_references:
- name: doordash_delivery_id
description: An external reference to a field/column in some database table that provides Doordash delivery IDs. This reference must be configured.
Client config
The client_config
field describes the necessary information to be able to create a base HTTP client. The values for host, username, and password are not defined here, only referenced in the form of a connector_param
which Fides uses to insert the actual value from the stored secrets.
client_config:
protocol: https
host: <host>
authentication:
strategy: basic
configuration:
username: <username>
password: <password>
The authentication strategies are swappable. This example uses the basic
authentication strategy, which takes a username
and password
in the configuration. An alternative to this is to use bearer
authentication which looks like this:
authentication:
strategy: bearer
configuration:
token: <api_key>
Fides also supports OAuth2 authentication.
Rate limits
The rate_limit_config
field describes rate limits which can be used to limit how fast requests can be made to an endpoint. They can also be configured at the endpoint request level.
rate_limit_config:
enabled: true
limits:
- rate: <rate>
period: <period>
Rate limiter supports second
, minute
, hour
and day
periods.
Test request
Once the base client is defined, use a test_request
to verify the hostname and credentials. This is in the form of an idempotent request (usually a read
request). The testing approach is the same for any connection.
test_request:
method: GET
path: /3.0/lists
Data protection request
If your third party integration supports something like a GDPR delete endpoint, that can be configured as a data_protection_request
. It has similar attributes to endpoint requests, but is generally one endpoint that removes all user information in one call.
data_protection_request:
method: POST
path: /v1beta/workspaces/<workspace_name>/regulations
param_values:
- name: workspace_name
connector_param: workspace
- name: user_id
identity: email
body: '{"regulation_type": "Suppress_With_Delete", "attributes": {"name": "userId", "values": ["<user_id>",]}}'
client_config:
protocol: https
host: <config_domain>
authentication:
strategy: bearer
configuration:
username: <access_token>
Endpoints
The endpoints configuration defines how collections are accessed and updated. The endpoint section contains the following members:
Attribute | Description |
---|---|
name | This name corresponds to a collection in the corresponding Dataset. |
after | To configure if this endpoint should run after other endpoints or collections. This should be a list of collection addresses. For example, after: [ mailchimp_connector_example.member ] would cause the current endpoint to run after the member endpoint. |
requests | A map of read , update , and delete requests for this collection. Each collection can define a way to read and a way to update the data. |
The requests
configuration further contains the following fields:
Attribute | Description |
---|---|
method | The HTTP method used for the endpoint. |
path | A static or dynamic resource path. The dynamic portions of the path are enclosed within angle brackets <dynamic_value> and are replaced with values from param_values . |
headers and query_params | The HTTP headers and query parameters to include in the request. |
headers.name , query_params.name | The value to use for the header or query param name. |
headers.value , query_params.value | This can be a static value, one or more of <dynamic_value> , or a mix of static and dynamic values (prefix <value> ) which will be replaced with the value sourced from the param_value with a matching name. |
body | Optional. A static or dynamic request body, with dynamic portions enclosed in brackets, just like path . These dynamic values will be replaced with values from param_values . |
param_values.name | Used as the key to reference this value from dynamic values in the path, headers, query, or body params. |
param_values.references | These are the same as references in a Dataset. It is used to define the source of the value for the given param_value . |
param_values.identity | Used to access the identity values passed into the privacy request such as email or phone number. |
param_values.connector_param | Used to access the user-configured secrets for the connection. |
ignore_errors | A boolean. If true, we will ignore non-200 status codes. |
data_path | The expression used to access the collection information from the raw JSON response. |
postprocessors | An optional list of response post-processing strategies. We will ignore this for the example scenarios below but an in depth-explanation can be found under SaaS Post-Processors. |
pagination | An optional strategy used to get the next set of results from APIs with resources spanning multiple pages. Details can be found under SaaS Pagination. |
grouped_inputs | An optional list of reference fields whose inputs are dependent upon one another. For example, an endpoint may need both an organization_id and a project_id from another endpoint. These aren't independent values, as a project_id belongs to an organization_id . You would specify this as ["organization_id", "project_id"] . |
client_config | Specify optional embedded Client Configs if an individual request needs a different protocol, host, or authentication strategy from the base Client Config. |
rate_limit_config | An optional rate limit configuration. Can be used to set rate limits or disable limiting for a specific endpoint. |
Param_values in more detail
The param_values
list is what provides the values to the various placeholders in the path, headers, query params and body. Values can be identities
, such as email or phone number, references
to fields in other collections or to an external_reference
, or connector_params
which are defined as part of configuring a SaaS connector.
Whenever a placeholder is encountered, the placeholder name is looked up in the list of param_values
and the corresponding value is used instead. Here is an example of placeholders being used in various locations
messages:
requests:
read:
method: GET
path: /<version>/messages
headers:
- name: Content-Type
value: application/json
- name: On-Behalf-Of
value: <email>
- name: Token
value: Custom <api_key>
query_params:
- name: count
value: 100
- name: organization:
value: <org_id>
- name: where:
value: properties["$email"]=="<email>"
param_values:
- name: email
identity: email
- name: api_key
connector_param: api_key
- name: org_id
connector_param: org_id
- name: version
connector_param: version
Generating requests
The following HTTP request properties are generated for each request based on the endpoint configuration:
- method
- path
- headers
- query params
- body
Method: This is a required field since a read, update, or delete endpoint might use any of the HTTP methods to perform the given action.
Path: This can be a static value or use placeholders. If the placeholders to build the path are not found at request-time, the request will fail.
Headers and query params: These can also be static or use placeholders. If a placeholder is missing, the request will continue and omit the given header or query param in the request
If reference values are used for the placeholders, each value will be processed independently unless the grouped_inputs
field is set. The following examples use query params but this applies to headers as well.
With ungrouped inputs (default)
read:
method: GET
path: /v1/disputes
query_params:
- name: charge
value: <charge_id>
- name: line_item
value: <line_item_id>
param_values:
- name: charge_id
references:
- dataset: connector_example
field: charge.id
direction: from
- name: line_item_id
references:
- dataset: connector_example
field: charge.line_item.id
direction: from
GET /v1/disputes?charge=1
GET /v1/disputes?charge=2
GET /v1/disputes?charge=3
GET /v1/disputes?line_item=a
GET /v1/disputes?line_item=b
GET /v1/disputes?line_item=c
With grouped inputs
read:
method: GET
path: /v1/disputes
grouped_inputs: [charge_id, payment_intent_id]
query_params:
- name: charge
value: <charge_id>
- name: line_item
value: <line_item_id>
param_values:
- name: charge_id
references:
- dataset: connector_example
field: charge.id
direction: from
- name: line_item_id
references:
- dataset: connector_example
field: charge.line_item.id
direction: from
GET /v1/disputes?charge=1&line_item=a
GET /v1/disputes?charge=2&line_item=b
GET /v1/disputes?charge=3&line_item=c
Body: The body can be static or use placeholders. If the placeholders to build the body are not found at request-time, the request will fail.
The following placeholders can be included in the body of an update:
<masked_object_fields>
- any masked fields, along with their masked value<all_object_fields>
- all object fields, including the masked fields and values
Fides will automatically fill in the value of these placeholders with the appropriate contents.
For example, an access request returned the following row
{
"id": 123,
"name": "Bobby Hill",
"address": "Arlen TX"
}
With the name
field masked, the value of each placeholder would be:
Placeholder | Value |
---|---|
<masked_object_fields> | "name":"MASKED" |
<all_object_fields> | "id":123,"name":"MASKED","address":"Arlen TX" |
all_object_fields
should be used if non-masked fields are required as part of the update payload.Read-Only fields: A field can be flagged as read-only
in the dataset to exclude it from the value of <all_object_fields>
(for example, if including the id
would cause an error).
- name: id
data_categories: [system.operations]
fidesops_meta:
read_only: True
This would result in the following change, with id
removed from the result:
Placeholder | Value |
---|---|
<all_object_fields> | "name":"MASKED","address":"Arlen TX" |
Example scenarios
Dynamic path with dataset references
endpoints:
- name: messages
requests:
read:
method: GET
path: /3.0/conversations/<conversation_id>/messages
param_values:
- name: conversation_id
references:
- dataset: mailchimp_connector_example
field: conversations.id
direction: from
In this example, /3.0/conversations/<conversation_id>/messages
is defined as the resource path for messages, and the path param of conversation_id
is defined as coming from the id
field of the conversations
collection. A separate GET HTTP request will be issued for each conversations.id
value.
# For three conversations with IDs of 1,2,3
GET /3.0/conversations/1/messages
GET /3.0/conversations/2/messages
GET /3.0/conversations/2/messages
Identity as a query param
endpoints:
- name: member
requests:
read:
method: GET
path: /3.0/search-members
query_params:
- name: query
value: <email>
param_values:
- name: email
identity: email
In this example, the placeholder in the query
query param will be replaced with the value of the param_value
with a name of email
, which is the email
identity. The result would look like this:
GET /3.0/search-members?query=name@email.com
Data update with a dynamic path
endpoints:
- name: member
requests:
update:
method: PUT
path: /3.0/lists/<list_id>/members/<subscriber_hash>
param_values:
- name: list_id
references:
- dataset: mailchimp_connector_example
field: member.list_id
direction: from
- name: subscriber_hash
references:
- dataset: mailchimp_connector_example
field: member.id
direction: from
This example uses two dynamic path variables, one from member.id
and one from member.list_id
. Since both of these are references to the member
collection, first issue a data retrieval (which will happen automatically if the read
request is defined). If a call to GET /3.0/search-members
returned the following member
object:
{
"list_id": "123",
"id": "456",
"merge_fields": {
"FNAME": "First",
"LNAME": "Last"
}
}
Then the update request would be:
PUT /3.0/lists/123/members/456
{
"list_id": "123",
"id": "456",
"merge_fields": {
"FNAME": "MASKED",
"LNAME": "MASKED"
}
}
and the contents of the body would be masked according to the configured policy.
Data update with a dynamic HTTP body
Sometimes, the update request needs a different body structure than what is obtained from the read request. In this example, use a custom HTTP body that contains the masked object fields.
update:
method: PUT
path: /crm/v3/objects/contacts
body: '{
"properties": {
<masked_object_fields>,
"user_ref_id": <user_ref_id>
}
}'
param_values:
- name: user_ref_id
references:
- dataset: dataset_test
field: contacts.user_ref_id
direction: from
Fides will replace the <masked_object_fields>
placeholder with the result of the policy-driven masking service (e.g. 'company': None, 'email': None
). Note that neither enclosing curly brackets ({
}
) nor a trailing comma (,
) are included as part of the replacement string.
This results in the following update request:
PUT /crm/v3/objects/contacts
{
"properties": {
"company": "None",
"email": "None"
"user_ref_id": "p983u4ncp3q8u4r"
}
}
How does this relate to graph traversal?
Fides uses the available Datasets to generate a graph of all reachable data and the dependencies between Datasets. For SaaS connectors, all the references and identities are stored in the param_values
, and must merge both the SaaS config and Dataset to provide a complete picture for the graph traversal. Using Mailchimp as an example, the Dataset collection and SaaS config endpoints for messages
looks like this:
collections:
- name: messages
fields:
- name: id
data_categories: [system.operations]
- name: conversation_id
data_categories: [system.operations]
- name: from_label
data_categories: [system.operations]
- name: from_email
data_categories: [user.contact.email]
- name: subject
data_categories: [system.operations]
- name: message
data_categories: [system.operations]
- name: read
data_categories: [system.operations]
- name: timestamp
data_categories: [system.operations]
endpoints:
- name: messages
requests:
read:
method: GET
path: /3.0/conversations/<conversation_id>/messages
param_values:
- name: conversation_id
references:
- dataset: mailchimp_connector_example
field: conversations.id
direction: from
postprocessors:
- strategy: unwrap
configuration:
data_path: conversation_messages
- strategy: filter
configuration:
field: from_email
value:
identity: email
An example of the augmented Dataset with the SaaS Config references would look like this:
collections:
- name: messages
fields:
- name: id
data_categories: [system.operations]
- name: conversation_id
data_categories: [system.operations]
fidesops_meta:
references:
- dataset: mailchimp_connector_example
field: conversations.id
direction: from
- name: from_label
data_categories: [system.operations]
- name: from_email
data_categories: [user.contact.email]
- name: subject
data_categories: [system.operations]
- name: message
data_categories: [system.operations]
- name: read
data_categories: [system.operations]
- name: timestamp
data_categories: [system.operations]
The conversation_id
field is updated with a reference from mailchimp_connector_example.conversations.id
. This means that the conversations
collection must be retrieved first, then forward the conversation IDs to the messages collection for further processing.
What if a collection has no dependencies?
In the Mailchimp example, there is a placeholder
request param
endpoints:
- name: conversations
requests:
read:
method: GET
path: /3.0/conversations
query_params:
- name: count
value: 1000
- name: offset
value: 0
param_values:
- name: placeholder
identity: email
Some endpoints might not have any external dependencies on identity
or Dataset reference
values. The way the Fides graph traversal interprets this is as an unreachable collection. At this time, the way to mark this as reachable is to include a param_value
with an identity or a reference.
In the future, collections like these will still be considered reachable even without this placeholder.