Scenario
I have a number of enterprise datasets that I must find missing links between, and one of the ways I use for finding potential matches is joining on first and last name. The complication is that we have a significant number of people who use their legal name in one dataset (employee records), but they use either a nickname or (worse yet) their middle name in others (i.e., EAD, training, PIV card, etc.). I am looking for a way to match up these potentially disparate names across the various datasets.
Simplified Example
Here is an overly simplified example of what I am trying to do, but I think it conveys my thought process. I begin with the employee table:
Employees table
employee_id | first_name | last_name |
---|---|---|
052451 | Robert | Armsden |
442896 | Jacob | Craxford |
054149 | Grant | Keeting |
025747 | Gabrielle | Renton |
071238 | Margaret | Seifenmacher |
and try to find the matching data from the PIV card dataset:
Cards table
card_id | first_name | last_name |
---|---|---|
1008571527 | Bobbie | Armsden |
1009599982 | Jake | Craxford |
1004786477 | Gabi | Renton |
1000628540 | Maggy | Seifenmacher |
Desired Result
After trying to match these datasets on first name and last name, I would like to end up with the following:
Employees_Cards table
emp_employee_id | emp_first_name | emp_last_name | crd_card_id | crd_first_name | crd_last_name |
---|---|---|---|---|---|
052451 | Robert | Armsden | 1008571527 | Bobbie | Armsden |
442896 | Jacob | Craxford | 1009599982 | Jake | Craxford |
054149 | Grant | Keeting | NULL | NULL | NULL |
025747 | Gabrielle | Renton | 1004786477 | Gabi | Renton |
071238 | Margaret | Seifenmacher | 1000628540 | Maggy | Seifenmacher |
As you can see, I would like to make the following matches:
Gabrielle -> Gabi
Jacob -> Jacob
Margaret -> Maggy
Robert -> Bobbie
My initial thought was to find a common names dataset along the lines of:
Name_Aliases table
name1 | name2 | name3 | name4 |
---|---|---|---|
Gabrielle | Gabi | NULL | NULL |
Jacob | Jake | NULL | NULL |
Margaret | Maggy | Maggie | Meg |
Michael | Mike | Mikey | Mick |
Robert | Bobbie | Bob | Rob |
and use something like this for the JOIN:
CREATE TABLE employee_cards AS
SELECT
employees.employee_id AS emp_employee_ID,
employees.first_name AS emp_first_name,
employees.last_name AS emp_last_name,
cards.card_id AS crd_card_id,
cards.first_name AS crd_first_name,
cards.last_name AS crd_last_name
FROM employees
LEFT OUTER JOIN name_aliases
LEFT OUTER JOIN cards
ON employees.first_name IN (
nane_aliases.name1,
nane_aliases.name2,
nane_aliases.name3,
nane_aliases.name4
)
AND employees.last_name = cards.last_name;
This is where I got stuck, since I could not figure out how to tie the result of the first ON condition to the first names in the cards table.
Some Questions
As I ponder more deeply into this problem, I know that I am not the first person who has encountered this need for matching on common name variants. My initial search pointed me to things like fuzzysearch
and soundex
, but those are not quite what I need for the present scenario (although they may come in handy down the road). Given this, I have several questions for the community:
Downloadable Common Names Variants Dataset?
Has someone compiled or crowd-sourced a comprehensive names variants dataset along the lines of my name-aliases
table above? My search landed me on a couple of sites that appear to have this kind of data, but neither one of them is downloadable for import into my local database.
I did find this SO discussion over a decade old, but it did not appear to be current: Database of common name aliases / nicknames of people
Also, I am unable to pay any money for this, so I hope there might be one hiding on github.
Better Way to Structure the Name_Aliases table?
Since each record in the name_aliases table can have two or more entries, is there a better way to set up the structure for this table, to make it infinitely flexible?
Match Any Column in Name_Aliases Table?
How do I set up the JOIN query to match up employees.first_name with any column in name_alises, and then finally match that with cards.first_name?
Better Solution?
Am I taking the wrong approach to solve this problem? Has someone come up with a more flexible and elegant approach to this using PostgreSQL?