Cleansing
InfoLink supports various mechanisms for data cleansing:
- Parsing
- Standardization
- Rule-based
- Via data providers
Parsing
To parse strings and extract values from them, use RegexMatcher operation.
Standardization
- To implement rule-based standardization, use CheckStandardizeColumns operation. CheckStandardizeColumns operation is an advanced version of Standardization specification, which is deprecated and described below only for backward compatibility.
- To implement standardization via data providers, use connectors such as Dun & Bradstreet, Melissa Data, SAP Data Quality Service.
Standardization specification (depricated)
NOTE: Standardization specification is replaced by CheckStandardizeColumns, which implements the same idea but provides much more flexibility. Below we describe Standardization specification for backward compatibility.
Standardization takes a table and adds columns that contain standardized (normalized) values of the table columns. Standardization is implemented by creating a standardization specification and then executing the specification by right clicking on it or using the GeneralStandardize operation in a scenario.
Standardization specifications are located in the Standardization subsection of the Specifications section in the navigation tree. You can create one by right clicking on the Standardization subsection.
Standardization specification consists the following elements:
-
Source: source name of the input table.
-
Space: space name of the input table.
-
Table: name of the input table.
-
Key Column: name of a key column (must be unique and not null) in the input table. This parameter is only required if you use java-based functions in the columns section.
-
Target Table: name of the result table.
-
Columns table contains a list of the result columns.
Defining result columns
Each record in the Columns table represent a column or a group of columns of the result table. The Result column name column defines the name of the column. The Pre expression, Function, Expression columns are used to compute the value of the result column.
Both Result column name and Expression can refer by
${col_name} to a column of the input table specified in the Input
column column.
The Input column column can have the following values:
-
Select
<any column>to copy all columns from the input table. -
<expression>used when expressions do not need to refer to a column from the input table. -
name of a column in the input table.
Standardization allows you to combine the power of the source query language (usually it is SQL), in which Pre expression and Expression are specified, with the Java functions that you specify in the Function column. When Function is specified for at least one column then a round trip of data from the source to InfoLink instance and back is required. So there is a perfomance pernalty for using functions that still can be negligible in practice. For your understanding, below we describe how standardization works in presence of any Function and without them.
When Function is used at least for one column, the value of the
result column is computed as follows. First Pre expression for all
columns are computed in the source database as a single query. You can
refer to the input column from Pre expression using a string
literal ${col_name}. InfoLink iterates over the result of the
query and applies the corresponding Function to each columns. The
result of the function application is stored to the source as temporaty
table. Then another query is evaluated to evaluate Expression for
each column. You can refer to the result of function application from
Expression using a string literal ${col_name}.
When Function is not used for any column, the whole operation is
executed as a single query in the source. First Pre expression is
evaluated then Expression is evaluated for each column. You can
refer to the input column from Pre expression using a string
litteral ${col\_name}. You can refer to the result of Pre
expression from Expression using a string literal
${col\_name}.
The following Functions are supported:
StanString.java- removes blank (such as space and tab) and special (defined by the following Java regex[ \\/\\\\:;\\-\\(\\)\\&\\.\\%\\@\\^\\$\\!\\*\\=\\_\\,\\?\\<\\>\\\#\[\]\~]+` ) from the input string and converts it to lower case.StanStringPreserveLength.java- the same asStandString.javabut replaces the characters with_(instead of removing them) thus preserving the length of the input string.