2

I have a large file of administrative data, about 1 million records. Individual people can be represented multiple times in this dataset. About half the records have an identifying code that maps records to individuals; for the half that don't, I need to fuzzy match names to flag records that potentially belong to the same person.

From looking at the records with the identifying code, I've created a list of differences that have occurred in the recording of names for the same individual:

  • Inclusion of middle name e.g. Jon Snow vs Jon Targaryen Snow
  • Inclusion of a second last name e.g. Jon Snow vs Jon Targaryen-Snow
  • Nickname / shortening of first name e.g. Jonathon Snow vs Jon Snow
  • Reversal of names e.g. Jon Snow vs Snow Jon
  • Mispellings/typos/variants: e.g. Samual/Samuel, Monica/Monika, Rafael/Raphael

Given the types of matches I'm after, is there a better approach than using agrep()/levenshtein's distance, that is easily implemented in R?

Edit: agrep() in R doesn't do a very good job with this problem - because of the large number of insertions and substitutions I need to allow to account for the ways names are recorded differently, a lot of false matches are thrown up.

edstatsuser
  • 220
  • 2
  • 7
  • Possible duplicate of [Create a unique ID by fuzzy matching of names (via agrep using R)](https://stackoverflow.com/questions/12999772/create-a-unique-id-by-fuzzy-matching-of-names-via-agrep-using-r) – Imran Ali Jul 28 '17 at 03:00
  • 1
    @ImranAli that question is about efficiency, I'm asking about the best approach to the problem given the particular differences in the way names are recorded in my dataset. – edstatsuser Jul 28 '17 at 03:06

2 Answers2

1

I would make multiple passes.

"Jon .* Snow" - Middle name

"Jon .*Snow" - Second last name

Nicknames will require a dictionary of mappings from long form to short, there's no regular expression that'll handle his.

"Snow Jon" - Reversal (duh)

agrep will handle minor misspellings.

You probably also want to tokenise your names into first-, middle- and last-.

shians
  • 955
  • 1
  • 6
  • 21
  • Thanks, I didn't consider making multiple passes and I thought it would be easier to keep a name as a single string rather than split it. Just to check - when you say 'reversal' you simply mean check for last name in first, and vice versa? – edstatsuser Jul 31 '17 at 22:06
  • Yes, that's what I mean by reversal. It's probably easier to do if you split things up into separate columns for first, middle and last name. – shians Aug 01 '17 at 01:55
0

The synthesisr package (https://cran.r-project.org/web/packages/synthesisr/index.html) might be helpful. It uses R code to mimic some of the fuzzy matching functionality in the fuzzywuzzy Python package and fuzzywuzzyR. There are different metrics similar taken from fuzzywuzzy; a lower score means a greater similarity. The methods are accessible into different ways as shown below.

Specifically, in this case, the "token" functions might be useful since strings are tokenized by whitespace then alphabetized to deal with situations like reversals.

library(synthesisr)

fuzz_m_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_m_ratio")

fuzz_partial_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_partial_ratio")

fuzz_token_sort_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_token_sort_ratio")

fuzz_token_set_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_token_set_ratio")
cannin
  • 2,735
  • 2
  • 25
  • 32