Matching specification
Matching specifications are located in the Matching subsection of the Specifications section in the navigation tree. To create matching specification:
-
Right click on the Matching subsection of the Specifications section of the navigation tree and select Create match specification.
-
In the pop up window enter a match specification name and select a data source which contains data you want to match.
Matching specification consists of:
-
Match table - name of the input table that contains duplicate records to match. It will be updated with the result of the matching. Columns that will be updated are listed below.
-
Consolidated table - name of the result table created by the system that contains consolidated records compiled from matched records.
-
One or more match rules that specify how to find duplicates.
-
Configuration for manual editing - name of tables that are created by the system to store the results of manual editing. When you create a new match specification the table names are generated automatically.
-
Consolidation rules that specify how to compile a consolidated (aka merged) record from a set of duplicate records.
It is a requirment that the match table must have the following column:
data_id nvarchar(900)- must be a primary key: unique identifier that cannot be null
At the beginning of matching specification execution, InfoLink will check whether the match table has the following columns. The missing columns will be added to the match table:
-
set_id nvarchar(900) - The result of matching will be stored in this column: records with the same set_id are duplicates.
-
prev_set_id nvarchar(900) - It is used for debugging. It will store the results of the previous match run so that you know what duplicates was identified on the current match rule.
-
match_specs nvarchar(4000) - It will contain a list of the names of matching runs that effected this record.
Matching specification is executed as follows:
-
Match rules are executed in the order as specified in the matching specification. Each match rule takes the match table as an input and updates the
set_id,prev_set_id,specscolumn. -
Manual merging and unmerging are applied.
-
Consolidation rules are executed. Consolidation takes the match table as an input and creates the consolidated table.
-
Manual edits of consolidated records are applied.
-
You can review and edit the result of the previous steps in the Steward User Interface (Steward UI) described below.
Match run specification
The matching specification can contain one or more match rules that are executed in the order as specified. Match rules specify how to find duplicates. Each match rule specification consists of:
- Name of the match rule that is used for diagnostics and will be appended to the match_specs column of the match table.
- Match columns that are used to identify duplicates.
- profile groups link opens a separate InfoLink windows that contains the size of groups by all “equal” columns. Below is the query to be executed. We use
count(distinct set_id)instead of justcount(*)to take into account that records can already be matched by other rules so that comparison between such records will not be calculated during this rule execution. Groups profile helps you to identify large groups that can lead to long execution of fuzzy matching comparisons.select <list of all "equal" columns>, count(distinct set_id) from <match table name> where <filter> group by <list of all "equal" columns> order by c desc - Filter is an SQL boolean expression used to apply this matching rule only to a subset of records.
- Type of the match rule. It can be Exact, Fuzzy, per column limit, or Fuzzy, combined single limit.
- Requires manual review checkbox allows you to mark this rule for manual review (aka manual matching rules. Unchecked rules are considered to be auto matching rulesauto matching rules. You can then executed only manual rules, only auto rules, or all rules. The most common use case is to execute all auto rules, commit the result to the source without review, then executed all manual rules, review the results via Steward UI, and commit the reviewed results of the manual rule execution.
- Disable checkbox exclude the rule from all kind of “execute all” commands.
- Skip groups larger than all groups larger than the specified number will be skipped (i.e. fuzzy matching comparisons will not be executed for such groups). Groups are defined by grouping by all “equal” columns. Leave it blank if you want all groups to be processed. This option is designed to keep the execution time of fuzzy matching rules under control (as execution time can explode in presence at least one large group) and will have no effect for rules without fuzzy matching.
Exact type of match rule
In Exact type you just list match column names. Records which have equal values of the match columns are considered duplicates. Exact type of match rule is simple, yet very powerful, becuase you can use columns that are preliminary standardized. You can achieve great results using Exact type on standardized columns.
”Fuzzy, per column limit” type of match run
This type of matching allows you to match records by a combination of columns specifying match cutoff for each column separately.
In this type of matching you can use various comparison functions (specified in the Comparison column in the list of Match columns) for comparing column values. For a list of comparison function, see the Comparison Functions section.
The score returned by the comparison function is compared with the Match cutoff value specified for the column. If the score is greater or equal to the match cutoff value then the column is a match. For records to be a match all the match columns must be a match.
”Fuzzy, combined single limit” type of match run
This type of matching allows you to match records by a combination of columns without specifying the exact set of columns that must be matched. It is implemented by comparing all pairs of records, computing a composite score for each pair of records, and comparing the composite score to a cutoff value.
In this type of matching you can use various comparison functions (specified in the Comparison column in the list of Match columns) for comparing column values. For a list of comparison function, see the Comparison Functions section.
The composite score of comparing two records is computed as a sum of scores of column comparisons for each column in the list of Match columns divided by the number of the Match columns. The result is percentage from 0 to 100
Match cutoff is specified as percentage (from 0 to 100) and will be compared with the composite score of each pair of records: two records are a match if the composite score is greater or equal to the match cutoff value.
Performance optimization tip
For tables of reasonable size, it is infeasible to compare all record pairs, because the number of possible pairs is the product of the number of records in such tables. To void all-pairs comparison, internally the optimizer implements blocking technique. Blocking partitions records into mutually exclusive and exhaustive subsets, and the matching process searches for matches only within a subset. Blocking is implemented by selecting blocking columns (i.e. match columns which comparison function is equal) and preliminary grouping records by the blocking columns. Thus practical recommendation is to have at least one blocking column (the more the better) in each Match Rule.
To select the match cutoff value you should review Weight Histogram Table, which will be created as result of running the match run in the determine cutoffs mode. The weight histogram table contains two columns: weight - the composite score and c - the number of pairs that have this composite score.
Pre-built comparison functions
InfoLink supports the following pre-built comparison functions that can be used in match rules:
-
equal - exact equality. The result of equal operation is computed as percentage: 0 (not equal) to 100 (equal).
-
edit_distance - is a way of quantifying how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other. The result is computed as percentage: from 0 (not equal) to 100 (exactly equal). Precisely speaking, the following formula is used to compute the result:
edit_distance(s1,s2) = (1 - levenshtein_edit_distance(s1, s2) / max(length(s1), length(s2))) * 100wherelevenshtein_edit_distanceis the number of insertion, deletion, and sustitution edits required to transform one string to the other. -
compare_words - splits compared strings into words (using space as delimiter) and then count the number of common words. The result is computed as percentage: from 0 (no common words) to 100 (have the same set of words). Precisely speaking, the following formula is used to compute the result:
compare_words(s1,s2) = 100 * number_of_common_words(s1,s2) / ((length(s1) + length(s2)) / 2) -
compare_stand_words - works as compare_words but the words are standardized using the general string standardization function (i.e. all blank and special characters are removed and all letters are lowercased).
-
contains - implements string contains function. The result of the comtains operation is computed as percentage: 0 (does not contain) to 100 (contains).
To experiment with the comparison functions you can connect to the database where the match and merge library is uploaded and try the function. For the SQL server based version (MSSQLServer source of Infolink) use the following functions:
-- returns 100
select [dbo].[il_equal]('KPMG LLP', 'KPMG LLP') as c;
-- returns 88
select [dbo].[il_edit_distance]('KPMG LLP', 'KPMG LL') as c;
-- returns 50
select [dbo].[il_compare_words]('point one','point two');
-- returns 50
select [dbo].[il_compare_stand_words]('point one','Point$% two');
-- returns 100
select [dbo].[il_contains]('point one','one'); User-defined comparison functions
You can use your own SQL function as comparison function. Hint: use RunSourceScript operation to load your custom function. Such function must take two nvarchar strings and return int. The result int must be from 0 to 100 so that you can implement a custom fuzzy matching function. For example:
IF OBJECT_ID('dbo.my_comparison_function') IS NOT NULL
DROP FUNCTION dbo.my_comparison_function
go
CREATE FUNCTION [dbo].[my_comparison_function](@v1 nvarchar(4000) , @v2 nvarchar(4000))
RETURNS int
WITH SCHEMABINDING
AS
BEGIN
DECLARE @res decimal = 0 -- return variable
if (@v1 is null or @v2 is null)
set @res = 0;
else if (@v1 = @v2)
set @res = 100;
else
set @res = 0;
RETURN @res;
END;
goUse your custom comparison function, enter its name into the Comparison column in a matching rule. You better use a fully qualified name, for example: [dbo].[my_comparison_function].
Manual matching
In the Manual matching section of matching you specify names of two tables that will be used to store the results of manual editing in the Steward User Interface. These tables will be created and managed by the system so you do not need to understand their structure.
Steward UI
In the Steward UI section of matching specification you provide a list of summary columns. If summary columns are specified, Steward UI will show only them by default. You can uncheck Show summary columns only checkbox in Steward UI to see all columns.
Consolidation rules
Consolidation rules define how to compile a consolidated (aka merged) record from a set of matched duplicates. Consolidation takes the match table as an input and creates a new consolidated table with the name defined in the matching specification (i.e. Consolidated table field). A set of matched records are those that have the same set_id in the match table.
Consolidation rules consists of a list of columns of the consolidated table. Those columns are called Survival columns. For each column you specify a strategy to compute the value of the column (Survivorship Function). You can select from the following options:
-
max - select the maximum value among all values from the matched records;
-
min - select the minimum value among all values from the matched records;
-
most_frequent - select the most frequent value among all values from the matched records.
-
You can also select the value by defining priority of the records. Record priority is defined by sorting records by any (one or several) columns. The value of the column is taken from the first record in the result of the sorting. You can also use priority specified in an additional table as described below.
Use the following JSON format to specify record priority:
{"filter":<SQL expr>,"priorityTable":<priority table name>,"priorityTableReferenceColumn":<column name that refers to the priority table>,"priorityColumns":<SQL expression with sorting columns>,"aggregate":<aggregate function name>}The records are first filtered with the filter, then joined with the priorityTable by the column priorityTableReferenceColumn of the input table and ‘value’ column of the priority table, then the aggregate function is applied to each subgroup of records that have the same value of the sorting columns. Everywhere you use a SQL expression, use ‘t.’ qualifier to refer to a column of the input table and ‘p.’ qualifier to refer to a column of the priority table. The priority table must have a column named ‘value’ that will be used in the join and a column with priority that will be used for sorting in the priorityColumns expression.
Below is a couple of examples that specify value selection for Fax column. The first example prioritizes records using a priority table named ‘Source_Priority’ and preferring recently updated records among records from the same source. In the second example we additionaly require that the selected value must not be null.
{"filter":null,"priorityTable":"Source_Priority","priorityTableReferenceColumn":"Source_System_Name","priorityColumns":"-p.priority desc, t.[Last_Updated] desc","aggregate":"max"}{"filter":"t.[Fax] is not null","priorityTable":"Source_Priority","priorityTableReferenceColumn":"Source_System_Name","priorityColumns":"-p.priority desc, t.[Last_Updated] desc","aggregate":"max"} -
concat_distinct - concatenates all values preliminarily removing duplicates. General form:
concat_distinct,<delimiter>,<order direction>.<delimiter>can be any string but comma. If you want to use comma, type:'@comma;'.<order direction>can beascordesc. Examples:concat_distinct,'|',asc concat_distinct,'@comma;',asc -
concat_distinct_mulivalue - works as concat_distinct but treats each value as result of concations using the delimiter. So it first split each value into a collection, concatenates all the collections into one, dedup the collection, sorts the collection according to the direction order, and concatinates all values in the collection into a string using the delimiter. Examples:
concat_distinct_mulivalue,'|',asc concat_distinct_mulivalue,'@comma;',asc