I have two large data sets (500k obs each) which I would like to link via fuzzy string matching on individuals' names but utilizing information on other variables as well. The problem is similar to the one described here: How can I match fuzzy match strings from two datasets?
However, the solution posted there requires to produce all pairwise potential matches first via expand.grid
but with my data this cannot be done. Already if you have two data sets of 10,000 obs each, then this results in a total data set of 100,000,000 potential pairwise matches.
I would like to produce a many-to-one merge first wherein observation $k$ from data set A is matched with the 5 closest matching observations from data set B (as judged by the Jaro Winkler string distance), which fall in a certain age band, say plus/minus 5 years.
For instance, if $k$ in data A is
name birthyear
John Smith 1984
and the other observations in data set B are
serial name birthyear
1 John Smith 1983
2 Sara Pinkert 1973
3 John Smyth 1999
4 John Smithe 1985
5 John Smith 1984
6 Jon Smith 1984
then the five "best" matches for $k$ with observations in data B should be obs number 1, 4, 5, 6 for a birth year restriction of +-5 years. In this case no. 2 (Sara Pinkert) should not match due to the name, and no. 3 (John Smyth) should not match because the birth year for this observation is too late.
Functions and commands provided by other libraries such as fastLink
, stringdist
, or recordLinkage
are good and fast but they always only produce one-to-one matches (and they rarely have the capability of incorporating the information from the birth year band to restrict the dimension of the matching problem).
The closes solution I have been able to figure out so far is using the compare.linkage
function from recordLinkage
but the blocking option (blockfld) seems to block strictly on a particular variable so it's not obvious how a range for the birth year information could be used:
rpairs = compare.linkage(dataA,
dataB,
blockfld = c("birthyear"),
identity1 = dataA$id1,
identity2 = dataB$id2,
n_match = 5,
strcmpfun = jarowinkler)
But this only blocks on perfect birthyear, hence it would return two matches which would be obs no. 5 and 6 (John Smith 1984, Jon Smith 1984).
Here is some sample data for the matching problem. Due to the small size it seems trivial but in the whole sample with half a million obs each (some of which appear in one but not in the other data, and some appearing in both but potentially with typos in their names) it's more tricky.
name1 = c("John Smith", "Adam Bower", "Felix von Epstein", "Charles Sawyer", "Benjamin Hoynes")
yob1 = c(1980, 1977, 1981, 1981, 1978)
dataA = data.frame(name1, yob1)
name2 = c("Jon Smyth", "Perry Bower", "Felix Epstein", "Terry Barnes", "John Smith", "Benamin Hoynes", "Frank Sawyer", "Charles Sawer", "Charles Sauer", "Philip Smith", "Franklin Sawyer", "Jonathan Smith", "Gabriel Bars", "Aron Bow", "Harry Haynes")
yob2 = c(1981, 1983, 1981, 1982, 1983, 1980, 1980, 1986, 1982, 1978, 1977, 1981, 1979, 1975, 1980)
dataB = data.frame(name2, yob2)