I'm a bit of an R novice and have been trying to experiment a bit using the agrep
function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name.
So far, to find all of the duplicates in my data set, I have been using agrep()
to accomplish the fuzzy name matching. I have been playing around with the max.distance argument in agrep()
to return different approximate matches. I think I have found a happy medium between returning false positives and missing out on true matches. As the agrep()
is limited to matching a single pattern at a time, I was able to find an entry on stack overflow to help me write a sapply
code that would allow me to match the data set against numerous patterns. Here is the code I am using to loop over numerous patterns as it combs through my data sets for "duplicates".
dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name)))
unique$name=
this is the unique index I developed that has all of the "patterns" I wish to hunt for in my data set.
vf$name= is the column in my data frame that contains all of my customer names.
This coding works well on a small scale of a sample of 600 or so customers and the agrep
works fine. My problem is when I attempt to use a unique index of 250K+ names and agrep
it against my 1.5 million customers. As I type out this question, the code is still running in R and has not yet stopped (we are going on 20 minutes at this point).
Does anyone have any suggestions to speed this up or improve the code that I have used? I have not yet tried anything out of the plyr
package. Perhaps this might be faster... I am a little unfamiliar though with using the ddply
or llply
functions.
Any suggestions would be greatly appreciated.