Matching two data sets via fuzzy many-to-one string match in R

Question

I have two large data sets (500k obs each) which I would like to link via fuzzy string matching on individuals' names but utilizing information on other variables as well. The problem is similar to the one described here: How can I match fuzzy match strings from two datasets?

However, the solution posted there requires to produce all pairwise potential matches first via expand.grid but with my data this cannot be done. Already if you have two data sets of 10,000 obs each, then this results in a total data set of 100,000,000 potential pairwise matches.

I would like to produce a many-to-one merge first wherein observation $k$ from data set A is matched with the 5 closest matching observations from data set B (as judged by the Jaro Winkler string distance), which fall in a certain age band, say plus/minus 5 years.

For instance, if $k$ in data A is

name          birthyear
John Smith    1984

and the other observations in data set B are

serial   name           birthyear
1        John Smith     1983
2        Sara Pinkert   1973
3        John Smyth     1999
4        John Smithe    1985
5        John Smith     1984
6        Jon Smith      1984

then the five "best" matches for $k$ with observations in data B should be obs number 1, 4, 5, 6 for a birth year restriction of +-5 years. In this case no. 2 (Sara Pinkert) should not match due to the name, and no. 3 (John Smyth) should not match because the birth year for this observation is too late.

Functions and commands provided by other libraries such as fastLink, stringdist, or recordLinkage are good and fast but they always only produce one-to-one matches (and they rarely have the capability of incorporating the information from the birth year band to restrict the dimension of the matching problem).

The closes solution I have been able to figure out so far is using the compare.linkage function from recordLinkage but the blocking option (blockfld) seems to block strictly on a particular variable so it's not obvious how a range for the birth year information could be used:

rpairs = compare.linkage(dataA, 
                     dataB, 
                     blockfld = c("birthyear"), 
                     identity1 = dataA$id1, 
                     identity2 = dataB$id2, 
                     n_match = 5, 
                     strcmpfun = jarowinkler)

But this only blocks on perfect birthyear, hence it would return two matches which would be obs no. 5 and 6 (John Smith 1984, Jon Smith 1984).

Here is some sample data for the matching problem. Due to the small size it seems trivial but in the whole sample with half a million obs each (some of which appear in one but not in the other data, and some appearing in both but potentially with typos in their names) it's more tricky.

name1 = c("John Smith", "Adam Bower", "Felix von Epstein", "Charles Sawyer", "Benjamin Hoynes")
yob1 = c(1980, 1977, 1981, 1981, 1978)
dataA = data.frame(name1, yob1)

name2 = c("Jon Smyth", "Perry Bower", "Felix Epstein", "Terry Barnes", "John Smith", "Benamin Hoynes", "Frank Sawyer", "Charles Sawer", "Charles Sauer", "Philip Smith", "Franklin Sawyer", "Jonathan Smith", "Gabriel Bars", "Aron Bow", "Harry Haynes")
yob2 = c(1981, 1983, 1981, 1982, 1983, 1980, 1980, 1986, 1982, 1978, 1977, 1981, 1979, 1975, 1980)
dataB = data.frame(name2, yob2)

I would like to help you, but can you provide just two sample datasets for us to toy around with? — InfiniteFlash, Dec 25 '17 at 22:14
@InfiniteFlashChess thanks a lot! I posted some parts of the data frames. Please let me know if there is anything else you need. — Andy, Dec 26 '17 at 11:14
Have you seen [this algorithm](https://github.com/wolfgarbe/SymSpell)? — F. Privé, Dec 27 '17 at 22:16
@F.Privé I don't know how many unique names there are but it should be in the order of 30 to 50% of the data. I haven't seen the SymSpell algorithm though it seems that there is no R implementation. — Andy, Dec 29 '17 at 12:22

CPak · Answer 1 · 2017-12-29T16:13:30.680

Edited for additional code based on comments

Maybe this will help you

Your data

name1 = c("John Smith", "Adam Bower", "Felix von Epstein", "Charles Sawyer", "Benjamin Hoynes")
yob1 = c(1980, 1977, 1981, 1981, 1978)
dataA = data.frame(name1, yob1)

name2 = c("Jon Smyth", "Perry Bower", "Felix Epstein", "Terry Barnes", "John Smith", "Benamin Hoynes", "Frank Sawyer", "Charles Sawer", "Charles Sauer", "Philip Smith", "Franklin Sawyer", "Jonathan Smith", "Gabriel Bars", "Aron Bow", "Harry Haynes")
yob2 = c(1981, 1983, 1981, 1982, 1983, 1980, 1980, 1986, 1982, 1978, 1977, 1981, 1979, 1975, 1980)
dataB = data.frame(name2, yob2)

Function for approximate string-match and age-band filtering

top_five_amatch <- function(A_row, B) {
                require(stringdist)
                ans <- intersect(order(stringdist(A_row$name1, dataB$name2, method="jw")), which(abs(A_row$yob1 - dataB$yob2) <= 5))
                return(head(ans, 5))
            }

At its heart is

library(stringdist)
order(stringdist(dataA$name1[1], dataB$name2, method="jw"))     # order of string-distance
# [1]  5  1 12 10 14  7  8  9  6 11  3  2  4 15 13

which(abs(dataA$yob1[1] - dataB$yob2) <= 5)                     # age band filter
# [1]  1  2  3  4  5  6  7  9 10 11 12 13 14 15

intersect of the 2 will keep only values that are present after age-band filtering

Main
Grab indexes of nearest match per row of dataA

I <- lapply(seq_len(nrow(dataA)), function(i) top_five_amatch(dataA[i,], dataB))
# [[1]]
# [1]  5  1 12 10 14

# [[2]]
# [1] 14  7  1  4  6

# [[3]]
# [1]  3  1  2  6 11

# [[4]]
# [1]  8  9  7 11  2

# [[5]]
# [1]  6 15  4  2 11

Top 5 matches for each row of dataA

matchB <- dataB[unlist(I), ]
               # name2 yob2
# 5         John Smith 1983
# 1          Jon Smyth 1981
# 12    Jonathan Smith 1981
# 10      Philip Smith 1978
# 14          Aron Bow 1975
# 14.1        Aron Bow 1975
# 7       Frank Sawyer 1980
# 1.1        Jon Smyth 1981
# 4       Terry Barnes 1982
# 6     Benamin Hoynes 1980
# 3      Felix Epstein 1981
# 1.2        Jon Smyth 1981
# 2        Perry Bower 1983
# 6.1   Benamin Hoynes 1980
# 11   Franklin Sawyer 1977
# 8      Charles Sawer 1986
# 9      Charles Sauer 1982
# 7.1     Frank Sawyer 1980
# 11.1 Franklin Sawyer 1977
# 2.1      Perry Bower 1983
# 6.2   Benamin Hoynes 1980
# 15      Harry Haynes 1980
# 4.1     Terry Barnes 1982
# 2.2      Perry Bower 1983
# 11.2 Franklin Sawyer 1977

To save in 'wide' format with multiple columns, try something like

matchB <- lapply(I, function(i) dataB[i,])
Reduce("cbind", matchB)
            # name2 yob2           name2 yob2           name2 yob2
# 5      John Smith 1983    Frank Sawyer 1980   Felix Epstein 1981
# 1       Jon Smyth 1981 Franklin Sawyer 1977  Benamin Hoynes 1980
# 12 Jonathan Smith 1981        Aron Bow 1975     Perry Bower 1983
# 10   Philip Smith 1978  Benamin Hoynes 1980    Terry Barnes 1982
# 14       Aron Bow 1975    Gabriel Bars 1979 Franklin Sawyer 1977
             # name2 yob2           name2 yob2
# 5    Charles Sawer 1986  Benamin Hoynes 1980
# 1    Charles Sauer 1982 Franklin Sawyer 1977
# 12 Franklin Sawyer 1977    Harry Haynes 1980
# 10    Frank Sawyer 1980    Terry Barnes 1982
# 14    Gabriel Bars 1979   Felix Epstein 1981

Thanks a lot, this does what I was looking for. It does not seem to scale up very well though (I was trying the code on the whole sample and it is still not finished after 20 hours). But I will try to figure out a better way to split the sample and run this on sub-samples separately. Is it possible to have the matches in new columns (match1, yob1, match2, yob2, ...) rather than everything in the same column? Otherwise I'm super happy with this, thank you! — Andy, Dec 29 '17 at 10:50
As you said, there are 100M matches to perform and this solution sacrifices performance for memory. You can look into a parallel solution after splitting your data into appropriate sized chunks. Perhaps worth another post? — CPak, Dec 29 '17 at 16:14
Actually, the point of having the age range restriction was to not do all these 100M computations. But I think that I didn't make this sufficiently clear in the question, so not your fault. Your code does what I asked for. Maybe I can figure out how to speed it up a little. Thanks a lot for your help! — Andy, Dec 30 '17 at 11:00

Matching two data sets via fuzzy many-to-one string match in R

1 Answers1