Faster R code for fuzzy name matching using agrep() for multiple patterns...?

Question

I'm a bit of an R novice and have been trying to experiment a bit using the agrep function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name.

So far, to find all of the duplicates in my data set, I have been using agrep() to accomplish the fuzzy name matching. I have been playing around with the max.distance argument in agrep() to return different approximate matches. I think I have found a happy medium between returning false positives and missing out on true matches. As the agrep() is limited to matching a single pattern at a time, I was able to find an entry on stack overflow to help me write a sapply code that would allow me to match the data set against numerous patterns. Here is the code I am using to loop over numerous patterns as it combs through my data sets for "duplicates".

dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name)))

unique$name= this is the unique index I developed that has all of the "patterns" I wish to hunt for in my data set.

vf$name= is the column in my data frame that contains all of my customer names.

This coding works well on a small scale of a sample of 600 or so customers and the agrep works fine. My problem is when I attempt to use a unique index of 250K+ names and agrep it against my 1.5 million customers. As I type out this question, the code is still running in R and has not yet stopped (we are going on 20 minutes at this point).

Does anyone have any suggestions to speed this up or improve the code that I have used? I have not yet tried anything out of the plyr package. Perhaps this might be faster... I am a little unfamiliar though with using the ddply or llply functions.

Any suggestions would be greatly appreciated.

Look into the stringdist package, particularly the amatch function. — n8sty, Jul 30 '14 at 02:14
Thank you for your responses. @n8sty, I have indeed explored stringdist a bit. However, I found that the amatch() returns a single (and closest) match. Agrep is great because it returns multiple approximate matches. — Benjamin, Jul 30 '14 at 15:20
@ JeremyS. I wasn't aware of the parSapply. I'll have a look. Is the coding similar to what I have above but simply swap in parSapply where sapply is? — Benjamin, Jul 30 '14 at 15:22
So, I have discovered a way to do away with the needed looping (`sapply`) for the multiple "patterns" I am feeding `agrep()`. I had no idea I could vectorize the `agrep`. Now that I have, I am able to feed thousands of name patterns to match against in my large data base. I have not yet explored the parallel processing. I wonder if 'data.table' could solve my problem with the slow computational times. Right now its taking about 6.5 minutes to agrep 1000 names in a data set of 31,000. — Benjamin, Jul 31 '14 at 20:55

score 2 · Answer 1 · answered Mar 24 '17 at 17:56

I'm so sorry, I missed this last request to post a solution. Here is how I solved my agrep, multiple pattern problem, and then sped things up using parallel processing.

What I am essentially doing is taking a a whole vector of character strings and then fuzzy matching them against themselves to find out if there are any fuzzy matched duplicate records in the vector.

Here I create clusters (twenty of them) that I wish to use in a parallel process created by parSapply

cl<-makeCluster(20)

So let's start with the innermost nesting of the code parSapply. This is what allows me to run the agrep() in a paralleled process. The first argument is "cl", which is the number of clusters I have specified to parallel process across ,as specified above.

The 2nd argument is the specific vector of patterns I wish to match against. The third argument is the actual function I wish to use to do the matching (in this case agrep). The next subsequent arguments are all arguments related to the agrep() that I am using. I have specified that I want the actual character strings returned (not the position of the strings) using value=T. I have also specified my max.distance I am willing to accept in a fuzzy match... in this case a cost of 2. The last argument is the full list of patterns I wish to be matched against the first list of patterns (argument 2). As it so happens, I am looking to identify duplicates, hence I match the vector against itself. The final output is a list, so I use unlist() and then data frame it to basically get a table of matches. From there, I can easily run a frequency table of the table I just created to find out, what fuzzy matched character strings have a frequency greater than 1, ultimately telling me that such a pattern match against itself and one other pattern in the vector.

truedupevf<-data.frame(unlist(parSapply(cl,
                                     s4dupe$fuzzydob,agrep,value=T,
                                     max.distance=2,s4dupe$fuzzydob)))

I hope this helps.

Faster R code for fuzzy name matching using agrep() for multiple patterns...?

1 Answers1

Linked