4

The agrep function gives some puzzling results and I'd like to understand its behavior better. For example:

agrep("abcd",c("abc","abcde","abcef"),value=T,max.distance = 1)

Returns: [1] "abc" "abcde" "abcef"

But the distance between "abcd" and "abcef" is 2. So I'm not sure why the third match shows up.

levenshteinDist("abcd","abcef") # gives the answer of 2

Also, I assume that the function would return only exact matches if distance cap is set at 0:

agrep("abcd",c("abc","abcde","abcef"),value=T,max.distance = 0)

However, I get [1] "abcde" as a match

It would be really helpful if someone could explain how the matching in agrep works.

xyy
  • 547
  • 1
  • 5
  • 12
  • 2
    I suspect that the rather testily written Note section in `?agrep` might apply here. ;) – joran May 15 '15 at 16:21
  • @joran are you referring to this: "Since someone who read the description carelessly even filed a bug report on it, do note that this matches substrings of each element of x (just as grep does) and not whole elements. See also adist in package utils, which optionally returns the offsets of the matched substrings." I read it but I don't fully understand it..not familiar with how grep works either – xyy May 15 '15 at 16:25
  • Yes, "this matches substrings of each element of x (just as grep does) and not whole elements". So `"abcd"` needs only to be within 1 of a _substring_ of the comparison strings. It is looking for matches _within_ (that is the word used in the Description section). – joran May 15 '15 at 16:28
  • @joran hm interesting, thanks for the response! So to clarify, the reason that "abcd" is matched to "abcef" in the first example is that if "d" is deleted from "abcd", it would be a match to the substring "abc" in "abcef"? Does this also mean that the transformations are always performed on the pattern argument? – xyy May 15 '15 at 16:40
  • I believe so, yes. I would describe it as "can I transform pattern into a substring of an element of x?" If yes, it matches. The source for agrep is [here](https://github.com/wch/r-source/blob/6d99f42982c486c12f54b937484ad41b8d608bb4/src/main/agrep.c) which would be the definitive answer, provided you know C. – joran May 15 '15 at 16:44
  • @joran thank you very much for clearing this up for me! – xyy May 15 '15 at 17:18

0 Answers0