Understanding Clustering
For each data product, Tamr Cloud uses both a trained machine-learning model and rules to identify and cluster source records that refer to the same real-world entity. For each cluster, the flow creates a single entity (a company, contact, patient, and so on) with the most appropriate values from the clustered records.
The model and rules are applied in the Apply Clustering Model step in the flow. Open the step to see the list of specific attributes used by the machine-learning model for the data product and the list of deterministic rules that are used to cluster records. This step:
- Uses the trained model to cluster records based on similarity of key attributes.
- Applies any rules to merge or split those clusters, based on matching or non-matching values for specific attributes.
- Finally, applies any source record overrides you have made as part of curation. See Adjusting Mastering Results to learn more about overrides.
The example below shows the machine learning attributes and clustering rules for B2B customers data products.
Clustering Model
First, similar records are clustered together by the trained machine learning model.
Models are data product specific, and are trained to accurately identify similar companies, contacts, patients, and so on. Each model considers the similarities of values for relevant attributes for the data product to determine which records should be clustered together.
For example, the B2B Customers data product, which masters and enriches company data, considers the similarities in these attributes to determine if source records represent the same company:
- Company name
- Alternate company names
- Full address and address components
- Phone number
- Website
To learn more about the clustering model for your data product, refer to the data product documentation.
Clustering Rules
After the records have been clustered by the model, clustering rules are applied to refine the results. Clustering rules deterministically identify records that should or should not be clustered together based on values in specific attributes that reliably indicate unique entities.
Types of Rules
A positive rule merges clusters with matching values for a specified attribute, such as a trusted_id
. Positive rules will not merge clusters that contain only null or empty values for the specified attribute.
A negative rule splits clusters that contain records with different values for a specific attribute. The rule splits the cluster so that each new cluster contains records with matching values for the attribute.
Rule Priority
Most data products include multiple rules, which are listed in descending order of priority. Rules with higher priority will take precedence over other rules if there is a conflict in the clustering logic.
Clustering Rule Examples
Most data product templates include clustering rules for trusted_id values. A trusted ID may be a Social Security Number, a patient identification number, an internal company identifier, and so on. In these data products, the rules for trusted_id
are as follows:
- Positive: Cluster records with matching trusted_id values together.
- Negative: Put records with different, non-null trusted_id into different clusters.
As another example, the Healthcare Provider data product uses both the trusted_id
rules above, and applies another, lower priority, rule that prevents healthcare providers who have different middle names from being clustered together.
Consider the three rules in the Healthcare Provider data product, that are listed in this priority order:
trusted_id
: Records with matching values are clustered together.trusted_id
: Records with non-matching non-null values are put into different clusters.ml_middle_name
: Records with non-matching non-null values are put into different clusters.
Here is how these rules are applied to a healthcare provider cluster:
- First, Tamr splits any records with different, non-null middle name values into different clusters.
- Then, Tamr splits any records in a cluster with non-matching, non-null values into different clusters.
- Finally, Tamr merges any clusters where records have the same trusted-id value.
This means that a record with a different ml_middle_name value
but the same trusted_id
value will end up being clustered together, because the rule to merge clusters with the same trusted_id
has a higher priority than the rule to split on middle name.
Based on these rules, the records in the table below are grouped into 3 clusters as follows:
- Cluster A: Record 1, 2, and 3. These records are clustered together because:
- Records 1 and 2 have the same
trusted_id
. Although theml_middle_name
values are different between the records, thetrusted_id
rule takes priority. - Since Record 3 has a blank
trusted_id
, it is included with the records with the most commontrusted_id
within the cluster.
- Records 1 and 2 have the same
- Cluster B: Record 4. This record is put into its own cluster because:
- It has a different
trusted_id
value than Records 1 and 2 and therefore is not included in Cluster A. - It has a different middle name than Record 5 and therefore cannot be clustered with Record 5 despite high similarity in other attribute values.
- It has a different
- Cluster C: Record 5. This record is put into its own cluster because it has a different middle name value than the other similar records.
Cluster | A | A | A | B | C |
---|---|---|---|---|---|
Attribute | Record 1 | Record 2 | Record 3 | Record 4 | Record 5 |
trusted_id | ab1cd2 | ab1cd2 | blank | ef3hi4 | blank |
address_line_1 | 123 Main Street | 3 Spruce Ave. | 123 Main Street | Main St. | 123 Main Street |
city | Springfield | blank | Springfield | Springfield | Springfield |
first_name | Christopher | Chris | Christopher | Chris | Chris |
last_name | Rogers | Rogers | Rogers | Rodgers | Rodgers |
middle_name | Adam | Arthur | Adam | Adam | Brian |
provider_specialty | Internal Medicine | Internal Medicine | Internal Medicine | Internal Medicine | Internal Medicine |
region | OH | MA | OH | OH | OH |
To learn how to check whether clustering rules have been applied to your source records, see About Persistent Identifier Fields.
Updated 3 months ago