1

I am trying to determine a systematic approach for identifying the closest number of unique persons that exist between three separate systems.

To better define the premise conditions: Systems:

System 1
System 2
System 3

Data-elements shared between systems:

Person First Name
Person Last Name
National ID Number
Date of Birth

Potential Data Imperfections:

Duplicate persons may exist within each system
Data-entry errors may result in typos within each system
Data-elements may be missing from persons within each system
No validation exists between the systems
Persons may exist in multiple systems, or only single system

With this in mind, what is the best logic to determine among the three systems an accurate count of unique persons? I understand that there will be a tiering of matching results based on quality of data that exists but I'm hoping to find a logical system to calculate that tiering.

Using this SQL implementation of the Levenshtein matching formula , it's possible to compare strings and numbers and calculate a number in keystrokes that differentiate two strings. I believe this may be a useful tool to determine likeness for matching and measuring imperfections.

  • 2
    Is it possible for you to share a sample dataset, containing examples of these imperfections, and large enough to make it possible to validate a possible algorithm? – evilmandarine May 18 '22 at 13:48

1 Answers1

1

What are the different systems/datastores you are mentioning ? Since you point to a TSQL Leventhstein implementation, is it safe to assume you are talking about 3 separate DBs ?

If I were you, I would first of all look for duplicate National ID Numbers - Probably there is only 1 way to write them and to search for them. Assuming these are unique values, upon a match, you can be confident that you found a duplicate. You will still need to cast between possible datatypes e.g. varchar in one system and integer in another.

From that point and on, you can scan these matches for LastName equality, then for FirstName and Date of birth matchings, handling possible null values appropriately etc.

Regarding the Date of Birth, you need to take care of the different date formatting that each system has stored the values with and perform the appropriate castings. This can work as an extra validation as soon as you've spotted a duplicate LastName or a National ID Number.

As for First Name and Last Name, since these are string values, you can expect typos and imperfections as you said. It's a good idea to check the Levenshtein distance between them.

Commercial tools (like this: https://www.dataqualityapps.com/know-how/118-sql-server-deduping-data.html) that perform Data Deduplication, almost always use some implementation of the Levenshtein's algorithm under the hood.