Understanding Clustering

For each data product, Tamr Cloud uses both a trained machine-learning model and rules to identify and cluster source records that refer to the same real-world entity. For each cluster, the flow creates a single entity (a company, contact, patient, and so on) with the most appropriate values from the clustered records.

The model and rules are applied in the Apply Clustering Model step in the flow. Open the step to see the list of specific attributes used by the machine-learning model for the data product and the list of deterministic rules that are used to cluster records. This step:

  1. Uses the trained model to cluster records based on similarity of key attributes.
  2. Applies any rules to merge or split those clusters, based on matching or non-matching values for specific attributes.
  3. Finally, applies any source record overrides you have made as part of curation. See Adjusting Mastering Results to learn more about overrides.

The example below shows the machine learning attributes and clustering rules for B2B customers data products.

Apply Clustering step in B2B Customers flow

Apply Clustering step in B2B Customers flow

Clustering Model

First, similar records are clustered together by the trained machine learning model.

Models are data product specific, and are trained to accurately identify similar companies, contacts, patients, and so on. Each model considers the similarities of values for relevant attributes for the data product to determine which records should be clustered together.

For example, the B2B Customers data product, which masters and enriches company data, considers the similarities in these attributes to determine if source records represent the same company:

  • Company name
  • Alternate company names
  • Full address and address components
  • Phone number
  • Website

To learn more about the clustering model for your data product, refer to the data product documentation.

Clustering Rules

After the records have been clustered by the model, clustering rules are applied to refine the results. Clustering rules deterministically identify records that should or should not be clustered together based on values in specific attributes that reliably indicate unique entities.

Types of Rules

A positive rule merges clusters with matching values for a specified attribute, such as a trusted_id. Positive rules will not merge clusters that contain only null or empty values for the specified attribute.

A negative rule splits clusters that contain records with different values for a specific attribute. The rule splits the cluster so that each new cluster contains records with matching values for the attribute.

Rule Priority

Most data products include multiple rules, which are listed in descending order of priority. Rules with higher priority will take precedence over other rules if there is a conflict in the clustering logic.

Clustering Rule Examples

Most data product templates include clustering rules for trusted_id values. A trusted ID may be a Social Security Number, a patient identification number, an internal company identifier, and so on. In these data products, the rules for trusted_id are as follows:

  1. Positive: Cluster records with matching trusted_id values together.
  2. Negative: Put records with different, non-null trusted_id into different clusters.

As another example, the Healthcare Provider data product uses both the trusted_id rules above, and applies another, lower priority, rule that prevents healthcare providers who have different middle names from being clustered together.

Consider the three rules in the Healthcare Provider data product, that are listed in this priority order:

  1. trusted_id: Records with matching values are clustered together.
  2. trusted_id: Records with non-matching non-null values are put into different clusters.
  3. ml_middle_name: Records with non-matching non-null values are put into different clusters.

Here is how these rules are applied to a healthcare provider cluster:

  1. First, Tamr splits any records with different, non-null middle name values into different clusters.
  2. Then, Tamr splits any records in a cluster with non-matching, non-null values into different clusters.
  3. Finally, Tamr merges any clusters where records have the same trusted-id value.

This means that a record with a different ml_middle_name value but the same trusted_id value will end up being clustered together, because the rule to merge clusters with the same trusted_id has a higher priority than the rule to split on middle name.

Based on these rules, the records in the table below are grouped into 3 clusters as follows:

  • Cluster A: Record 1, 2, and 3. These records are clustered together because:
    • Records 1 and 2 have the same trusted_id. Although the ml_middle_name values are different between the records, the trusted_id rule takes priority.
    • Since Record 3 has a blank trusted_id, it is included with the records with the most common trusted_id within the cluster.
  • Cluster B: Record 4. This record is put into its own cluster because:
    • It has a different trusted_id value than Records 1 and 2 and therefore is not included in Cluster A.
    • It has a different middle name than Record 5 and therefore cannot be clustered with Record 5 despite high similarity in other attribute values.
  • Cluster C: Record 5. This record is put into its own cluster because it has a different middle name value than the other similar records.
Cluster A A A B C
Attribute Record 1 Record 2 Record 3 Record 4 Record 5
trusted_id ab1cd2 ab1cd2 blank ef3hi4 blank
address_line_1 123 Main Street 3 Spruce Ave. 123 Main Street Main St. 123 Main Street
city Springfield blank Springfield Springfield Springfield
first_name Christopher Chris Christopher Chris Chris
last_name Rogers Rogers Rogers Rodgers Rodgers
middle_name Adam Arthur Adam Adam Brian
provider_specialty Internal Medicine Internal Medicine Internal Medicine Internal Medicine Internal Medicine
region OH MA OH OH OH

To learn how to check whether clustering rules have been applied to your source records, see About Persistent Identifier Fields.