What is a SaaS configuration?

A SaaS connector is defined in two parts: the Dataset, and the SaaS configuration. The Dataset describes the data that is available from the connector, and the SaaS config describes how to connect to, and retrieve or update the data in the connector.

When accessing data from APIs, each application (and different endpoints within the same application) can follow different patterns, making their requirements different from Database connectors. Fides provides a flexible configuration to define different access/update patterns.

An example SaaS config

This guide will use a SaaS configuration to connect to Mailchimp. The configuration schema defines:

The domain and authentication requirements for an HTTP client to access Mailchimp
A test request for verifying the integration was set up correctly
Endpoints to the following resources within the Mailchimp API:
- GET and PUT for the members (opens in a new tab) resource
- GET for the conversations (opens in a new tab) resource
- GET for the messages (opens in a new tab) resource

The following is an example SaaS config for Mailchimp:

saas_config:
  fides_key: mailchimp_connector_example
  name: Mailchimp SaaS Config
  type: mailchimp
  description: A sample schema representing the Mailchimp connector for fides
  version: 0.0.1
 
  connector_params:
    - name: domain
    - name: username
    - name: api_key
 
  client_config:
    protocol: https
    host: <domain>
    authentication:
      strategy: basic
      configuration:
        username: <username>
        password: <api_key>
 
  rate_limit_config:
    limits:
    - rate: 10
      period: second
 
  test_request:
    method: GET
    path: /3.0/lists
 
  endpoints:
  - name: messages
    requests:
      read:
        method: GET
        path: /3.0/conversations/<conversation_id>/messages
        param_values:
          - name: conversation_id
            references:
              - dataset: mailchimp_connector_example
                field: conversations.id
                direction: from
        data_path: conversation_messages
        postprocessors:
          - strategy: filter
            configuration:
              field: from_email
              value:
                identity: email
        rate_limit_config:
          enabled: false
  - name: conversations
    requests:
      read:
        method: GET
        path: /3.0/conversations
        query_params:
          - name: count
            value: 1000
          - name: offset
            value: 0
        param_values:
          - name: placeholder
            identity: email
        data_path: conversations
        pagination:
          strategy: offset
          configuration:
            incremental_param: offset
            increment_by: 1000
            limit: 10000
  - name: member
    requests:
      read:
        method: GET
        path: /3.0/search-members
        query_params:
          - name: query
            value: <email>
        param_values:
          - name: email
            identity: email
        data_path: exact_matches.members
      update:
        method: PUT
        path: /3.0/lists/<list_id>/members/<subscriber_hash>
        param_values:
          - name: list_id
            references:
              - dataset: mailchimp_connector_example
                field: member.list_id
                direction: from
          - name: subscriber_hash
            references:
              - dataset: mailchimp_connector_example
                field: member.id
                direction: from

SaaS config metadata fields

Attribute	Description
`fides_key`	Used to uniquely identify the connector, and to link a SaaS config to a dataset.
`name`	A human-readable name for the connector.
`type`	Type of SaaS connector. Choose from `hubspot`, `mailchimp`, or the other available connectors, or use `custom` for other types.
`description`	Used to add a useful description.
`version`	Used to track different versions of the SaaS config.

The above configuration also contains the following complex fields

connector_params
client_config
rate_limit_config
test_request
endpoints
data_protection_request

Integration params

The connector_params field is used to describe a list of settings which a user must configure as part of the setup. A default_value can also be used to include values such as a standard base domain for an API or a recommended page size for pagination. Make sure to not include confidential values such as passwords or API keys, these values are added as part of the Connection secrets. When configuring a connector's secrets for the first time, the default values will be used if a value is not provided.

connector_params:
  - name: domain
    default_value: api.stripe.com
  - name: username
  - name: password
  - name: page_size
    default_value: 100

External references

The external_references field is used to describe a list of external dataset references that a user must configure as part of the setup. These dataset references point to fields in a registered Fides dataset that are used to provide necessary values to the SaaS integration execution. The location of these fields depends on the use-case, so they cannot be determined beforehand, and therefore must be configured at the time of setting up the SaaS connector.

external_references:
    - name: doordash_delivery_id
      description: An external reference to a field/column in some database table that provides Doordash delivery IDs. This reference must be configured.

Client config

The client_config field describes the necessary information to be able to create a base HTTP client. The values for host, username, and password are not defined here, only referenced in the form of a connector_param which Fides uses to insert the actual value from the stored secrets.

client_config:
  protocol: https
  host: <host>
  authentication:
    strategy: basic
    configuration:
      username: <username>
      password: <password>

The authentication strategies are swappable. This example uses the basic authentication strategy, which takes a username and password in the configuration. An alternative to this is to use bearer authentication which looks like this:

authentication:
  strategy: bearer
  configuration:
    token: <api_key>

Fides also supports OAuth2 authentication.

Rate limits

The rate_limit_config field describes rate limits which can be used to limit how fast requests can be made to an endpoint. They can also be configured at the endpoint request level.

rate_limit_config:
  enabled: true
  limits:
  - rate: <rate>
    period: <period>

Rate limiter supports second, minute, hour and dayperiods.

Test request

Once the base client is defined, use a test_request to verify the hostname and credentials. This is in the form of an idempotent request (usually a read request). The testing approach is the same for any connection.

test_request:
  method: GET
  path: /3.0/lists

Data protection request

If your third party integration supports something like a GDPR delete endpoint, that can be configured as a data_protection_request. It has similar attributes to endpoint requests, but is generally one endpoint that removes all user information in one call.

data_protection_request:
  method: POST
  path: /v1beta/workspaces/<workspace_name>/regulations
  param_values:
    - name: workspace_name
      connector_param: workspace
    - name: user_id
      identity: email
  body: '{"regulation_type": "Suppress_With_Delete", "attributes": {"name": "userId", "values": ["<user_id>",]}}'
  client_config:
    protocol: https
    host: <config_domain>
    authentication:
      strategy: bearer
      configuration:
        username: <access_token>

Endpoints

The endpoints configuration defines how collections are accessed and updated. The endpoint section contains the following members:

Attribute	Description
`name`	This name corresponds to a collection in the corresponding Dataset.
`after`	To configure if this endpoint should run after other endpoints or collections. This should be a list of collection addresses. For example, `after: [ mailchimp_connector_example.member ]` would cause the current endpoint to run after the `member` endpoint.
`requests`	A map of `read`, `update`, and `delete` requests for this collection. Each collection can define a way to read and a way to update the data.

The requests configuration further contains the following fields:

Attribute	Description
`method`	The HTTP method used for the endpoint.
`path`	A static or dynamic resource path. The dynamic portions of the path are enclosed within angle brackets `<dynamic_value>` and are replaced with values from `param_values`.
`headers` and `query_params`	The HTTP headers and query parameters to include in the request.
`headers.name`, `query_params.name`	The value to use for the header or query param name.
`headers.value`, `query_params.value`	This can be a static value, one or more of `<dynamic_value>`, or a mix of static and dynamic values (prefix `<value>`) which will be replaced with the value sourced from the `param_value` with a matching name.
`body`	Optional. A static or dynamic request body, with dynamic portions enclosed in brackets, just like `path`. These dynamic values will be replaced with values from `param_values`.
`param_values.name`	Used as the key to reference this value from dynamic values in the path, headers, query, or body params.
`param_values.references`	These are the same as `references` in a Dataset. It is used to define the source of the value for the given `param_value`.
`param_values.identity`	Used to access the identity values passed into the privacy request such as email or phone number.
`param_values.connector_param`	Used to access the user-configured secrets for the connection.
`ignore_errors`	A boolean. If true, we will ignore non-200 status codes.
`data_path`	The expression used to access the collection information from the raw JSON response.
`postprocessors`	An optional list of response post-processing strategies. We will ignore this for the example scenarios below but an in depth-explanation can be found under SaaS Post-Processors.
`pagination`	An optional strategy used to get the next set of results from APIs with resources spanning multiple pages. Details can be found under SaaS Pagination.
`grouped_inputs`	An optional list of reference fields whose inputs are dependent upon one another. For example, an endpoint may need both an `organization_id` and a `project_id` from another endpoint. These aren't independent values, as a `project_id` belongs to an `organization_id`. You would specify this as `["organization_id", "project_id"]`.
`client_config`	Specify optional embedded Client Configs if an individual request needs a different protocol, host, or authentication strategy from the base Client Config.
`rate_limit_config`	An optional rate limit configuration. Can be used to set rate limits or disable limiting for a specific endpoint.

Param_values in more detail

The param_values list is what provides the values to the various placeholders in the path, headers, query params and body. Values can be identities, such as email or phone number, references to fields in other collections or to an external_reference, or connector_params which are defined as part of configuring a SaaS connector.

Whenever a placeholder is encountered, the placeholder name is looked up in the list of param_values and the corresponding value is used instead. Here is an example of placeholders being used in various locations

messages:
  requests:
    read:
      method: GET
      path: /<version>/messages
      headers:
        - name: Content-Type
          value: application/json
        - name: On-Behalf-Of
          value: <email>
        - name: Token
          value: Custom <api_key>
      query_params:
        - name: count
          value: 100
        - name: organization:
          value: <org_id>
        - name: where:
          value: properties["$email"]=="<email>"
      param_values:
        - name: email
          identity: email
        - name: api_key
          connector_param: api_key
        - name: org_id
          connector_param: org_id
        - name: version
          connector_param: version

Generating requests

The following HTTP request properties are generated for each request based on the endpoint configuration:

method
path
headers
query params
body

Method: This is a required field since a read, update, or delete endpoint might use any of the HTTP methods to perform the given action.

Path: This can be a static value or use placeholders. If the placeholders to build the path are not found at request-time, the request will fail.

Headers and query params: These can also be static or use placeholders. If a placeholder is missing, the request will continue and omit the given header or query param in the request

If reference values are used for the placeholders, each value will be processed independently unless the grouped_inputs field is set. The following examples use query params but this applies to headers as well.

With ungrouped inputs (default)

read:
  method: GET
  path: /v1/disputes
  query_params:
    - name: charge
      value: <charge_id>
    - name: line_item
      value: <line_item_id>
  param_values:
  - name: charge_id
    references:
      - dataset: connector_example
        field: charge.id
        direction: from
  - name: line_item_id
    references:
      - dataset: connector_example
        field: charge.line_item.id
        direction: from

GET /v1/disputes?charge=1
GET /v1/disputes?charge=2
GET /v1/disputes?charge=3
GET /v1/disputes?line_item=a
GET /v1/disputes?line_item=b
GET /v1/disputes?line_item=c

With grouped inputs

read:
  method: GET
  path: /v1/disputes
  grouped_inputs: [charge_id, payment_intent_id]
  query_params:
    - name: charge
      value: <charge_id>
    - name: line_item
      value: <line_item_id>
  param_values:
  - name: charge_id
    references:
      - dataset: connector_example
        field: charge.id
        direction: from
  - name: line_item_id
    references:
      - dataset: connector_example
        field: charge.line_item.id
        direction: from

GET /v1/disputes?charge=1&line_item=a
GET /v1/disputes?charge=2&line_item=b
GET /v1/disputes?charge=3&line_item=c

Body: The body can be static or use placeholders. If the placeholders to build the body are not found at request-time, the request will fail.

The following placeholders can be included in the body of an update:

<masked_object_fields> - any masked fields, along with their masked value
<all_object_fields> - all object fields, including the masked fields and values

Fides will automatically fill in the value of these placeholders with the appropriate contents.

For example, an access request returned the following row

{
  "id": 123,
  "name": "Bobby Hill",
  "address": "Arlen TX"
}

With the name field masked, the value of each placeholder would be:

Placeholder	Value
`<masked_object_fields>`	`"name":"MASKED"`
`<all_object_fields>`	`"id":123,"name":"MASKED","address":"Arlen TX"`

💡

all_object_fields should be used if non-masked fields are required as part of the update payload.

Read-Only fields: A field can be flagged as read-only in the dataset to exclude it from the value of <all_object_fields> (for example, if including the id would cause an error).

- name: id
  data_categories: [system.operations]
  fides_meta:
    read_only: True

This would result in the following change, with id removed from the result:

Placeholder	Value
`<all_object_fields>`	`"name":"MASKED","address":"Arlen TX"`

Example scenarios

Dynamic path with dataset references

endpoints:
  - name: messages
    requests:
      read:
        method: GET
        path: /3.0/conversations/<conversation_id>/messages
        param_values:
          - name: conversation_id
            references:
              - dataset: mailchimp_connector_example
                field: conversations.id
                direction: from

In this example, /3.0/conversations/<conversation_id>/messages is defined as the resource path for messages, and the path param of conversation_id is defined as coming from the id field of the conversations collection. A separate GET HTTP request will be issued for each conversations.id value.

# For three conversations with IDs of 1,2,3
GET /3.0/conversations/1/messages
GET /3.0/conversations/2/messages
GET /3.0/conversations/2/messages

Identity as a query param

endpoints:
  - name: member
    requests:
      read:
        method: GET
        path: /3.0/search-members
        query_params:
          - name: query
            value: <email>
        param_values:
          - name: email
            identity: email

In this example, the placeholder in the query query param will be replaced with the value of the param_value with a name of email, which is the email identity. The result would look like this:

GET /3.0/search-members?query=name@email.com

Data update with a dynamic path

endpoints:
  - name: member
    requests:
      update:
        method: PUT
        path: /3.0/lists/<list_id>/members/<subscriber_hash>
        param_values:
          - name: list_id
            references:
              - dataset: mailchimp_connector_example
                field: member.list_id
                direction: from
          - name: subscriber_hash
            references:
              - dataset: mailchimp_connector_example
                field: member.id
                direction: from

This example uses two dynamic path variables, one from member.id and one from member.list_id. Since both of these are references to the member collection, first issue a data retrieval (which will happen automatically if the read request is defined). If a call to GET /3.0/search-members returned the following member object:

{
    "list_id": "123",
    "id": "456",
    "merge_fields": {
      "FNAME": "First",
      "LNAME": "Last"
    }
}

Then the update request would be:

PUT /3.0/lists/123/members/456
 
{
    "list_id": "123",
    "id": "456",
    "merge_fields": {
      "FNAME": "MASKED",
      "LNAME": "MASKED"
    }
}

and the contents of the body would be masked according to the configured policy.

Data update with a dynamic HTTP body

Sometimes, the update request needs a different body structure than what is obtained from the read request. In this example, use a custom HTTP body that contains the masked object fields.

update:
  method: PUT
  path: /crm/v3/objects/contacts
  body: '{
    "properties": {
      <masked_object_fields>,
      "user_ref_id": <user_ref_id>            
    }
  }'
  param_values:
    - name: user_ref_id
      references:
        - dataset: dataset_test
          field: contacts.user_ref_id
          direction: from

Fides will replace the <masked_object_fields> placeholder with the result of the policy-driven masking service (e.g. 'company': None, 'email': None). Note that neither enclosing curly brackets ({ }) nor a trailing comma (,) are included as part of the replacement string.

This results in the following update request:

PUT /crm/v3/objects/contacts
 
{
  "properties": {
      "company": "None",
      "email": "None"
      "user_ref_id": "p983u4ncp3q8u4r"
  }
}

How does this relate to graph traversal?

Fides uses the available Datasets to generate a graph of all reachable data and the dependencies between Datasets. For SaaS connectors, all the references and identities are stored in the param_values, and must merge both the SaaS config and Dataset to provide a complete picture for the graph traversal. Using Mailchimp as an example, the Dataset collection and SaaS config endpoints for messages looks like this:

collections:
  - name: messages
    fields:
      - name: id
        data_categories: [system.operations]
      - name: conversation_id
        data_categories: [system.operations]
      - name: from_label
        data_categories: [system.operations]
      - name: from_email
        data_categories: [user.contact.email]
      - name: subject
        data_categories: [system.operations]
      - name: message
        data_categories: [system.operations]
      - name: read
        data_categories: [system.operations]
      - name: timestamp
        data_categories: [system.operations]

endpoints:
  - name: messages
    requests:
      read:
        method: GET
        path: /3.0/conversations/<conversation_id>/messages
        param_values:
          - name: conversation_id
            references:
              - dataset: mailchimp_connector_example
                field: conversations.id
                direction: from
        postprocessors:
          - strategy: unwrap
            configuration:
              data_path: conversation_messages
          - strategy: filter
            configuration:
              field: from_email
              value:
                identity: email

An example of the augmented Dataset with the SaaS Config references would look like this:

collections:
  - name: messages
    fields:
      - name: id
        data_categories: [system.operations]
      - name: conversation_id
        data_categories: [system.operations]
        fides_meta:
          references:
            - dataset: mailchimp_connector_example
              field: conversations.id
              direction: from
      - name: from_label
        data_categories: [system.operations]
      - name: from_email
        data_categories: [user.contact.email]
      - name: subject
        data_categories: [system.operations]
      - name: message
        data_categories: [system.operations]
      - name: read
        data_categories: [system.operations]
      - name: timestamp
        data_categories: [system.operations]

The conversation_id field is updated with a reference from mailchimp_connector_example.conversations.id. This means that the conversations collection must be retrieved first, then forward the conversation IDs to the messages collection for further processing.

What if a collection has no dependencies?

In the Mailchimp example, there is a placeholder request param

endpoints:
  - name: conversations
    requests:
      read:
        method: GET
        path: /3.0/conversations
        query_params:
          - name: count
            value: 1000
          - name: offset
            value: 0
        param_values:
          - name: placeholder
            identity: email

Some endpoints might not have any external dependencies on identity or Dataset reference values. The way the Fides graph traversal interprets this is as an unreachable collection. At this time, the way to mark this as reachable is to include a param_value with an identity or a reference.

In the future, collections like these will still be considered reachable even without this placeholder.

Third-Party Integrations OAuth2 Configuration