Cleansing

InfoLink supports various mechanisms for data cleansing:

Parsing
Standardization
- Rule-based
- Via data providers

Parsing

To parse strings and extract values from them, use RegexMatcher operation.

Standardization

To implement rule-based standardization, use CheckStandardizeColumns operation. CheckStandardizeColumns operation is an advanced version of Standardization specification, which is deprecated and described below only for backward compatibility.
To implement standardization via data providers, use connectors such as Dun & Bradstreet, Melissa Data, SAP Data Quality Service.

Standardization specification (depricated)

NOTE: Standardization specification is replaced by CheckStandardizeColumns, which implements the same idea but provides much more flexibility. Below we describe Standardization specification for backward compatibility.

Standardization takes a table and adds columns that contain standardized (normalized) values of the table columns. Standardization is implemented by creating a standardization specification and then executing the specification by right clicking on it or using the GeneralStandardize operation in a scenario.

Standardization specifications are located in the Standardization subsection of the Specifications section in the navigation tree. You can create one by right clicking on the Standardization subsection.

Standardization specification consists the following elements:

Source: source name of the input table.
Space: space name of the input table.
Table: name of the input table.
Key Column: name of a key column (must be unique and not null) in the input table. This parameter is only required if you use java-based functions in the columns section.
Target Table: name of the result table.
Columns table contains a list of the result columns.

Defining result columns

Each record in the Columns table represent a column or a group of columns of the result table. The Result column name column defines the name of the column. The Pre expression, Function, Expression columns are used to compute the value of the result column.

Both Result column name and Expression can refer by ${col_name} to a column of the input table specified in the Input column column.

The Input column column can have the following values:

Select <any column> to copy all columns from the input table.
<expression> used when expressions do not need to refer to a column from the input table.
name of a column in the input table.

Standardization allows you to combine the power of the source query language (usually it is SQL), in which Pre expression and Expression are specified, with the Java functions that you specify in the Function column. When Function is specified for at least one column then a round trip of data from the source to InfoLink instance and back is required. So there is a perfomance pernalty for using functions that still can be negligible in practice. For your understanding, below we describe how standardization works in presence of any Function and without them.

When Function is used at least for one column, the value of the result column is computed as follows. First Pre expression for all columns are computed in the source database as a single query. You can refer to the input column from Pre expression using a string literal ${col_name}. InfoLink iterates over the result of the query and applies the corresponding Function to each columns. The result of the function application is stored to the source as temporaty table. Then another query is evaluated to evaluate Expression for each column. You can refer to the result of function application from Expression using a string literal ${col_name}.

When Function is not used for any column, the whole operation is executed as a single query in the source. First Pre expression is evaluated then Expression is evaluated for each column. You can refer to the input column from Pre expression using a string litteral ${col\_name}. You can refer to the result of Pre expression from Expression using a string literal ${col\_name}.

The following Functions are supported:

StanString.java - removes blank (such as space and tab) and special (defined by the following Java regex [ \\/\\\\:;\\-\$\$\\&\\.\\%\\@\\^\\$\\!\\*\\=\\_\\,\\?\\<\\>\\\#\[\]\~]+` ) from the input string and converts it to lower case.
StanStringPreserveLength.java - the same as StandString.java but replaces the characters with _ (instead of removing them) thus preserving the length of the input string.