I am trying to determine a systematic approach for identifying the closest number of unique persons that exist between three separate systems.
To better define the premise conditions: Systems:
System 1
System 2
System 3
Data-elements shared between systems:
Person First Name
Person Last Name
National ID Number
Date of Birth
Potential Data Imperfections:
Duplicate persons may exist within each system
Data-entry errors may result in typos within each system
Data-elements may be missing from persons within each system
No validation exists between the systems
Persons may exist in multiple systems, or only single system
With this in mind, what is the best logic to determine among the three systems an accurate count of unique persons? I understand that there will be a tiering of matching results based on quality of data that exists but I'm hoping to find a logical system to calculate that tiering.
Using this SQL implementation of the Levenshtein matching formula , it's possible to compare strings and numbers and calculate a number in keystrokes that differentiate two strings. I believe this may be a useful tool to determine likeness for matching and measuring imperfections.