1

I am attempting string matching in R using the agrep command. However I am concerned that it stops when it finds a good match, rather than optimize to find the best one. Though it is possible my understanding of how it works is incorrect. My example below reproduces the problem, albeit crudely.

example1 <- c("height","weight")
example2 <- c("height","weight")

y <- c("","")
for( i in 1: 2 ){
x <- agrep(example1[i], example2, max.distance = 1, ignore.case=TRUE, value=TRUE, useBytes=TRUE ) 
x <- paste0(x,"")
y[i] <- x
  }

As you will hopefully see, agrep has matched weight to height, when weight is the better match and also present.

Why is this?

LanieD
  • 30
  • 8
  • The value `x` is a vector of two values, but only the first is assigned to `y` (you should have a warning), so only `height` is given – etienne Nov 23 '16 at 16:09
  • So x contains all matches that meet the criteria, and I am just choosing the first of these, which isn't necessarily the best match? Is there a way to extract just the best one? At the moment I am having to do an exact match followed by a fuzzy match to get around the problem. However this example is not filling me with confidence for the rest of my fuzzy matches. – LanieD Nov 23 '16 at 16:12
  • 1
    yes that's it. You might look at [this question for the best match](http://stackoverflow.com/questions/5721883/agrep-only-return-best-matches/27090472) – etienne Nov 23 '16 at 16:15
  • Thanks for the assistance. – LanieD Nov 23 '16 at 16:24

1 Answers1

1

You can try adist (for generalized Levenshtein (edit) distance), with the following result ('height' from example1 best matches with height from example2 etc.):

adist(example1, example2)
     [,1] [,2]
[1,]    0    1
[2,]    1    0

example2[apply(adist(example1, example2), 1, which.min)]
# [1] "height" "weight"
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63