4

I'm trying to match a string to a vector of strings:

a <- c('abcde', 'abcdf', 'abcdg')

agrep('abcdh', a, max.distance=list(substitutions=1))
# [1] 1 2 3

agrep('abchh', a, max.distance=list(substitutions=2))
# character(0)

I didn't expect the latter result as substituting two characters from the pattern makes the pattern identical to the vector elements. This does, however, work with all instead of substitutions:

agrep('abchh', a, max.distance=list(all=2))
# [1] 1 2 3

What do I need to change to match with more than 1 substitution allowed? Is substitution just a broken option? Thanks.

Note: this question is essentially the same as this one: https://stat.ethz.ch/pipermail/r-help/2011-June/281731.html, but that was never answered.

esa606
  • 370
  • 3
  • 13
  • 2
    If you want to only allow substitions, you could use `all=2, insertions=0, deletions=0, substitions=2`. I can't explain the behaviour other than to add that for, your example, it disappears when the string length is greater than 10, so might be linked to `If cost is not given, all defaults to 10%` (from `?agrep`) – ping Jul 22 '14 at 14:45
  • Hmm, for me it did not disappear even when I used very long strings. It does seem a little buggy, like it's overriding itself. But good idea for a workaround, thanks! – esa606 Jul 22 '14 at 15:45
  • For the sake of using the same comparison, I was comparing `agrep("abchh", "abcdd", max.distance=list(substitutions=2))` to `agrep("aaaaaaabchh", "aaaaaaabcdd", max.distance=list(substitutions=2)) ` – ping Jul 22 '14 at 15:48
  • I tried this and noted that integer(0) came back, but also the difference between .2 and .21 > agrep('abchh', a, max.distance=0.21) [1] 1 2 3 > agrep('abchh', a, max.distance=0.2) integer(0) – lawyeR Jul 22 '14 at 15:56

1 Answers1

1

I did not realize that the questions were that old, anyway:

The function needs cost to be appropiate. As ping said, you must set the maximum number of match cost; in your example:

a <- c('abcde', 'abcdf', 'abcdg')
agrep('abcdh', a, max.distance = list(cost = 1))
[1] 1 2 3
agrep('abchh', a, max.distance = 2)
[1] 1 2 3

Now, if you set cost the program can do insertions, deletions and substitutions. If you want only evaluate substitutions, then:

agrep('abhhh', a, 
        max.distance=list(cost=3, substitutions=3, 
                          deletions=0, insertions=0))
[1] 1 2 3