2

I have two data files to merge with and both of them have the keyword fund_name, but the fund_name in the two files may be different and it's possible that some of the rows have no matches. Therefore, I want to do a fuzzy matching, returning the best match for each row.

I've read a relevant thread agrep: only return best matches and I've tried amatch(string, stringVector, maxDist = Inf) function in the package stringdist, and it worked well.

I saw there're many different method (i.e. string distance metrics) in amatch() like "osa","lv", "dl"... I wonder if I can combine them and return a value only when all of them find the same match. If so, how should I write the algorithm?

I care more about the accuracy of a match than finding a match in this fuzzy matching work. Many thanks for your help!

Hank
  • 121
  • 4
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 09 '18 at 20:18
  • check out the `fuzzyjoin` package – phiver Jul 10 '18 at 12:12

1 Answers1

1

A possible solution is using the native R adist function to calculate the Levenshtein distance:

names<-"key, fund_name, keyword"
names_split<-strsplit(names, ", ")[[1]]

names2<-"fund_name2, other_keyword"
names_split2<-strsplit(names2, ", ")[[1]]

# It creates a matrix with the Standard Levenshtein distance between the name fields of both sources
dist.name<-adist(names_split, names_split2, partial = TRUE, ignore.case = TRUE)

# We now take the pairs with the minimum distance
min.name<-apply(dist.name, 1, min)

match.s1.s2<-NULL  
for(i in 1:nrow(dist.name))
{
  s2.i<-match(min.name[i],dist.name[i,])
  s1.i<-i
  match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=names_split2[s2.i], s1name=names_split[s1.i], adist=min.name[i]),match.s1.s2)
}
# and we then can have a look at the results
View(match.s1.s2)
wp78de
  • 18,207
  • 7
  • 43
  • 71
  • 2
    In stead of `base::dist` you can use `stringdist::stringdistmatrix`, which runs multithreaded and gives you access to 9 difference distance metrics. –  Aug 06 '18 at 13:05