I have a large file of administrative data, about 1 million records. Individual people can be represented multiple times in this dataset. About half the records have an identifying code that maps records to individuals; for the half that don't, I need to fuzzy match names to flag records that potentially belong to the same person.
From looking at the records with the identifying code, I've created a list of differences that have occurred in the recording of names for the same individual:
- Inclusion of middle name e.g. Jon Snow vs Jon Targaryen Snow
- Inclusion of a second last name e.g. Jon Snow vs Jon Targaryen-Snow
- Nickname / shortening of first name e.g. Jonathon Snow vs Jon Snow
- Reversal of names e.g. Jon Snow vs Snow Jon
- Mispellings/typos/variants: e.g. Samual/Samuel, Monica/Monika, Rafael/Raphael
Given the types of matches I'm after, is there a better approach than using agrep()/levenshtein's distance, that is easily implemented in R?
Edit: agrep() in R doesn't do a very good job with this problem - because of the large number of insertions and substitutions I need to allow to account for the ways names are recorded differently, a lot of false matches are thrown up.