26

I'm using the 'agrep' function in R, which returns a vector of matches. I would like a function similar to agrep that only returns the best match, or best matches if there are ties. Currently, I am doing this using the 'sdist()' function from the package 'cba' on each element of the resulting vector, but this seems very redundant.

/edit: here is the function I'm currently using. I'd like to speed it up, as it seems redundant to calculate distance twice.

library(cba)
word <- 'test'
words <- c('Teest','teeeest','New York City','yeast','text','Test')
ClosestMatch <- function(string,StringVector) {
  matches <- agrep(string,StringVector,value=TRUE)
  distance <- sdists(string,matches,method = "ow",weight = c(1, 0, 2))
  matches <- data.frame(matches,as.numeric(distance))
  matches <- subset(matches,distance==min(distance))
  as.character(matches$matches)
}

ClosestMatch(word,words)
Zach
  • 29,791
  • 35
  • 142
  • 201

2 Answers2

30

The agrep package uses Levenshtein Distances to match strings. The package RecordLinkage has a C function to calculate the Levenshtein Distance, which can be used directly to speed up your computation. Here is a reworked ClosestMatch function that is around 10x faster

library(RecordLinkage)

ClosestMatch2 = function(string, stringVector){

  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]

}
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • @DWin. Thanks for the correction. I have edited my answer to correct the spelling. – Ramnath Apr 20 '11 at 02:51
  • Thanks for the answer, that's a great function. What is the intended purpose of that package? There may be other functions in there relevant to my project. – Zach Apr 20 '11 at 13:25
  • 1
    @Zach. Yes. it is likely to contain a lot of functions relevant to your work. There are a lot of vignettes on the CRAN page for this package that you can lookup (http://cran.r-project.org/web/packages/RecordLinkage/index.html) – Ramnath Apr 20 '11 at 14:26
  • You can change from distance == max(distance) to which.max(distance) – Maciej Mar 16 '14 at 08:43
  • Hi I know this is pretty old but I was wondering. Is there a way to extend this function to get a minimal Levenshtein score and `NA` if that minimum is not reached? I have to combine two long vectors of words but it's very likely that in at least 50% of cases, there is no close match... – SJDS May 10 '14 at 12:30
  • The package RecordLinkage has been removed from CRAN because of failures to fix memory problems. – lawyeR Jul 14 '14 at 14:06
  • 1
    Package `RecordLinkage` is available on CRAN, again (Version 0.4-9 as of 2016-05-02. – Uwe Jul 15 '16 at 09:53
  • @SJDS You could wrap the function in an if statement that would only match if a certain max(distance) was reached. I am doing this now. You could add the cutoff as an argument in the function as well, and include a condition in the function that returns some sort of null value if this cutoff is not reached. – Patrick Williams Feb 28 '18 at 20:20
14

RecordLinkage package was removed from CRAN, use stringdist instead:

library(stringdist)

ClosestMatch2 = function(string, stringVector){

  stringVector[amatch(string, stringVector, maxDist=Inf)]

}
Alexander Sigachov
  • 1,541
  • 11
  • 16
  • 3
    Package `RecordLinkage` is available on CRAN, again (Version 0.4-9 as of 2016-05-02. – Uwe Jul 15 '16 at 09:52