Data Matching for Deduplication (MatchOneSource)

Matching and Merging Data in a Source (Enterworks, Salesforce, etc)

Usually sources do not provide enough functionality to executed match-n-merge inside the source, so we move the data to a staging database, execute match-n-merge in the staging database, and then update the source with the result. The whole process is usually configured as two scenarios. The first scenario is to download data from the source using Load or Transform operation, standardize the data using CheckStandardizeColumns operation (an optional but recommended step), run match-n-merge using MatchOneSource operation. After executing the first scenario the data can be manually reviewed in the Steward UI. The second scenario implements updating data in the source: removing duplicate records and updating records to which removed records were merged. In our Tutorials application you can find an example of such scenarios for merging accounts in Salesforce: Dedup 1 - Run and Dedup 2 - Commit respectively. The same scenarios can be reconfigured to run for Enterworks by setting source and target sources to a source of the EnterworksSource type.

Configuring match-n-merge as a set of scenarios makes the process easy to customize. For example, if your matching rules require columns from multiple source tables (e.g. multiple Enterworks repositories), you can easily extend the first scenario by downloading multiple tables and joining them together before the standardization step. Modifying scenarios you can also configure support for lineage or incremental data updates as described below.

Using Data Standardization To Improve Matching

Use CheckStandardizeColumns operation before MatchOneSource operation to standardize columns that you are going to use for matching. Preliminary data standardization helps increasing the match rate and also reducing the need for expensive fuzzy matching functions. The most universal and used standardization function is /string/format/normalize. It can be applied to any string column to remove spaces and convert the value to lower case. Thus even exact matching rules (aka equal rules) in the MatchOneSource operation will match string values with extra spaces and with characters in a difference case. See CheckStandardizeColumns and Functions for more details.

Matching Rules

We suggest reading the following recommendations in order they are listed as a subsequent recommendation can contribute the previous recommendations.

Reduce using fuzzy matching functions by using data standardization (as described in Data Standardization) with exact matching rules instead. You want to avoid fuzzy matching functions when possible because fuzzy matching functions are inherently expensive to compute and typically require more manual review.
Each rule must have at least one column with exact (aka equal) comparison. In presence of a fuzzy matching function (such as edit_distance, etc), the input table is first grouped by all equal columns specified in the rule and then inside each group the fuzzy matching functions are applied to each possible pair in the group. It means that fuzzy matching has quadratic complexity on the size of the group. Plus fuzzy matching functions themselves are computationally expensive so applying them to each possible pair takes a lot of time. Thus without equal columns, fuzzy matching functions would have to be applied to each possible pair among all values of the column leading to a very long execution time even for a table with a moderate number of records.
Use equal columns to reduce the size of groups used for fuzzy matching functions. It will help increasing performance significantly because large groups leads to quadratic explosion in expensive fuzzy matching function computations. You can review the group sizes by running the following query:
```
select <list-of-columns-with-equal-comparison>, count(*) as c from <input-table-name>
group by <list-of-columns-with-equal-comparison>
order by c desc   
```
In the UI you can also click on the profile groups links above Match columns in the matching rules specification. It will display the result of the above query. If you see at least one group with a hundred or more values, you should consider adding another (or even several) equal column to the rule to slit large groups.
A common technique for using equal columns to minimize the size of the groups is to create an additional column during standardization phase that contains only few leading (or trailing) characters from a column and adding this additional column with equal comparison to the matching rule. To compensate for not catching fuzzy matches in the leading characters, you can add another matching rule with the same columns except replacing the column with leading characters to the column with trailing characters.
Another solution to the problem of large groups in fuzzy matching is to exclude them from matching by setting Skip groups larger than parameter.
The most common reasons for match execution taking too much time: you have a rule with fuzzy matching function(s) where groups formed by equal columns are too large. Find the above recommendation on how to use equal columns to reduce the size of the groups or use the Skip groups larger than parameter as recommended above.
NULL values considered to be equal in matching. To exclude records with NULL values set Filter in the matching rule. Filter can be any SQL expression, for example:
```
[Name_st] is not null and [BillingPostalCode_st] is not null
```

Supporting Lineage

Lineage allows tracing master record values to the source records and facilitates propagation of source’s updates into the master records. Let’s consider lineage implementation for Enterworks (the described approach can be applied to other sources).

Create two repositories in Enterworks: Customer_Source and Customer. Customer_Source repository stores the copy of records from the external sources (e.g. multiple ERP systems) with one-to-one relationship to the external source. It means that each record in the Customer_Source repository corresponds to exactly one record in a external source and can be looked up by the external source record id. In case of multiple external sources we usually create two fields: source_name and source_id to link to the external source records. Customer repository contains master records. There is a many-to-one relationship between Customer_Source and Customer repositories: Customer_Source repository has a field usually called master_id that links the source record to one master record. On intial load, Customer_Source and Customer repositories contain the same number of records with an actual one-to-one relationship. After running match-n-merge, some records in the master repository will be merged and the corresponding records in the source repository will be relinked to the merged master records, thus resulting in an actual many-to-one relationship.

Notice that we suggest to put all external sources into a single Customer_Source repository. It implies that the data unification should be done upstream before loading into the Customer_Source repository. Also Customer_Source and Customer should have the same (or as close as possible) set of columns so that merging record is just to pick the right values from the duplicate records (vs configuring complex data transformations in the merging rules). We suggest such preliminary data unification as it does not make sense to put it into merging when it can be done preliminary.

Usually match-n-merge is run over the master repository (vs source repository). Although it is possible to find an example when running it over the source repository would increase match rate, such gain in the match rate is not significant yet complicates the implementation.

Incremental Data Updates and Match-and-Merge

In case of incremental update, you periodically recieve a set of records that were updated in the external sources. You apply the changes to your copy of the source data. Updates in your data can lead to a new potential duplicates so we need to re-run match-n-merge. If your data is of a moderate size it is alright to run match-n-merge over the whole data set without taking into account what records were updated. The advantage of this solution is simplitity. If the perfomance is not acceptable, you can optimize this solution by running match-n-merge only on the effected subset of all records: updated records plus recoords that can potentially match to the updated records. It is implemented as follows. As the result of updating your data you know what records were updated and you can copy this records into a separate table (eg customer_updated). Let us assume that the whole dataset is stored in customer. Run MatchOneSource operation with getPotentialMatches command passing customer (as table1) and customer_updated (as table2) as parameters. The result will contain a subset of customer over which you run match-n-merge. For more information read about getPotentialMatches command in Executing Matching Specification.