R: how to find the most optimal string matches while combining different distance metrics criteria?

Question

I have two data files to merge with and both of them have the keyword fund_name, but the fund_name in the two files may be different and it's possible that some of the rows have no matches. Therefore, I want to do a fuzzy matching, returning the best match for each row.

I've read a relevant thread agrep: only return best matches and I've tried amatch(string, stringVector, maxDist = Inf) function in the package stringdist, and it worked well.

I saw there're many different method (i.e. string distance metrics) in amatch() like "osa","lv", "dl"... I wonder if I can combine them and return a value only when all of them find the same match. If so, how should I write the algorithm?

I care more about the accuracy of a match than finding a match in this fuzzy matching work. Many thanks for your help!

When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Jul 09 '18 at 20:18

score 1 · Answer 1 · answered Jul 20 '18 at 20:31

A possible solution is using the native R adist function to calculate the Levenshtein distance:

names<-"key, fund_name, keyword"
names_split<-strsplit(names, ", ")[[1]]

names2<-"fund_name2, other_keyword"
names_split2<-strsplit(names2, ", ")[[1]]

# It creates a matrix with the Standard Levenshtein distance between the name fields of both sources
dist.name<-adist(names_split, names_split2, partial = TRUE, ignore.case = TRUE)

# We now take the pairs with the minimum distance
min.name<-apply(dist.name, 1, min)

match.s1.s2<-NULL  
for(i in 1:nrow(dist.name))
{
  s2.i<-match(min.name[i],dist.name[i,])
  s1.i<-i
  match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=names_split2[s2.i], s1name=names_split[s1.i], adist=min.name[i]),match.s1.s2)
}
# and we then can have a look at the results
View(match.s1.s2)

In stead of `base::dist` you can use `stringdist::stringdistmatrix`, which runs multithreaded and gives you access to 9 difference distance metrics. — , Aug 06 '18 at 13:05

R: how to find the most optimal string matches while combining different distance metrics criteria?

1 Answers1