Bulk Import De-Duplication Processes

The batch and Registry de-duplication are separate processes that run at different times, either with the same or different match rules. For illustration purposes, this diagram describes both de-duplication processes:

image described in text

  1. TCA Registry attributes are transformed for the staged schema. The attributes to include in the schema, as well as the transformations to use on each attribute, are defined in the Define Attributes and Transformations page.

    Also defined are the attribute and transformation combinations to be used for bulk duplicate identification. The staged schema includes B-Tree indexes only for the transformed attributes marked for bulk duplicate identification.

  2. The user specifies a match rule with Bulk Duplicate Identification purpose for the de-duplication.

  3. When the de-duplication process starts, the acquisition and scoring transformations are applied to the attributes in the interface tables, based on the selected match rule.

  4. The transformed interface table records are mapped and loaded into the interface search tables, a set of temporary staged tables with B-Tree indexes.

  5. To find duplicates within the TCA interface tables:

    To find duplicates between the TCA interface tables and the TCA Registry:

  6. Matched acquisition attribute values determine the most relevant subset of records from the interface search tables to form the work unit.

  7. Using the scoring criteria in the match rule, each record in the work unit is compared to all other work unit records in the same staging table.

  8. A score is calculated for each record in the work unit, and scores for all entities are added together for determining duplicate parties.

  9. The score of each work unit record is compared against the match and automatic merge thresholds defined in the match rule.

Related Topics