3

EDIT: This bug was found in 32-bit versions of R was fixed in R version 2.9.2.


This was tweeted to me by @leoniedu today and I don't have an answer for him so I thought I would post it here.

I have read the documentation for agrep() (fuzzy string matching) and it appears that I don't fully understand the max.distance parameter. Here's an example:

pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
agrep(pattern,x,max.distance=18) 
agrep(pattern,x,max.distance=19)

That behaves exactly like I would expect. There are 18 characters different between the strings so I would expect that to be the threshold of a match. Here's what's confusing me:

agrep(pattern,x,max.distance=30) 
agrep(pattern,x,max.distance=31)
agrep(pattern,x,max.distance=32) 
agrep(pattern,x,max.distance=33)

Why are 30 and 33 matches, but not 31 and 32? To save you some counting,

> nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
> nchar("Bundeskanzleramt")
[1] 16
oguz ismail
  • 1
  • 16
  • 47
  • 69
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • http://www.nabble.com/possible-agrep-bug--R-2.9.1,-Mac-OS-X-10.5-(PR-13789)-td24285192.html – ars Jul 25 '09 at 21:49
  • Follow up. It was a bug in 32 bit R, which was fixed in R2.9.2. (as detailed in Brian Ripley's message from August 14th at R-list in the link above.) – Eduardo Leoni Sep 03 '09 at 01:06
  • if you could post that comment as an answer I'll happily accept it and close this question out with an answer. Thanks for pointing out the bug fix. – JD Long Sep 03 '09 at 17:01

2 Answers2

2

I posted this on the R list a while back and reported as a bug in R-bugs-list. I had no useful responses, so I twitted to see if the bug was reproducible or I was just missing something. JD Long was able to reproduce it and kindly posted the question here.

Note that, at least in R, then, agrep is a misnomer since it does not matches regular expressions, while grep stands for "Globally search for the Regular Expression and Print". It shouldn't have a problem with patterns longer than the target vector. (i think!)

In my linux server, all is well but not so in my Mac and Windows machines.

Mac: sessionInfo() R version 2.9.1 (2009-06-26) i386-apple-darwin8.11.1 locale: en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

agrep(pattern,x,max.distance=30) [1] 1

agrep(pattern,x,max.distance=31) integer(0) agrep(pattern,x,max.distance=32) integer(0) agrep(pattern,x,max.distance=33) [1] 1

Linux: R version 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu

locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

agrep(pattern,x,max.distance=30) [1] 1 agrep(pattern,x,max.distance=31) [1] 1 agrep(pattern,x,max.distance=32) [1] 1 agrep(pattern,x,max.distance=33) [1] 1

Eduardo Leoni
  • 8,991
  • 6
  • 42
  • 49
  • I think it is safe to say this is a bug. Even if the use case is atypical, I can't see a good reason why this should produce different results on different platforms. Were you able to reproduce this with different strings of different lengths? – JD Long Jul 26 '09 at 00:45
0

I am not sure if your example makes sense. For the basic grep(), pattern is often a simple or a regular expression, and x is a vector whose element get matched to pattern. Having pattern as longer string that x strikes me as odd.

Consider this where we just use grep instead of substr:

R> grep("vo", c("foo","bar","baz"))   # vo is not in the vector
integer(0)
R> agrep("vo", c("foo","bar","baz"), value=TRUE) # but is close enough to foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.25) # still foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.75) # now all match
[1] "foo" "bar" "baz"
R>  
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • 1
    From the docs of agrep, it's a Levenshtein distance, so the length of the pattern doesn't matter -- we're interested in some transformation of the pattern that results in a substring of the string in question. (It's more interesting when the max distance is held low.) But it's obviously a bug, since the edit distance doesn't just become discontinuous between 30 and 33. – ars Jul 26 '09 at 09:44