12

I expect the regex pattern ab{,2}c to match only with a followed by 0, 1 or 2 bs, followed by c.

It works that way in lots of languages, for instance Python. However, in R:

grepl("ab{,2}c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
# [1]  TRUE  TRUE  TRUE  TRUE FALSE

I'm surprised by the 4th TRUE. In ?regex, I can read:

{n,m} The preceding item is matched at least n times, but not more than m times.

So I agree that {,2} should be written {0,2} to be a valid pattern (unlike in Python, where the docs state explicitly that omitting n specifies a lower bound of zero).

But then using {,2} should throw an error instead of returning misleading matches! Am I missing something or should this be reported as a bug?

Scarabee
  • 5,437
  • 5
  • 29
  • 55
  • 1
    You used the default TRE regex engine. If you use the PCRE one, you would get false for all items. Always specify the lower bound to get consistent behavior across engines. – Wiktor Stribiżew Oct 29 '17 at 12:16
  • 1
    That's nuts! Looks like a bug to me. – janos Oct 29 '17 at 12:23
  • 1
    @Wiktor: my concern is about consistency in the TRE engine alone: how can it match 3 `b`s and not match 4 `b`s when I'm asking it to match at most 2 `b`s? – Scarabee Oct 29 '17 at 12:42

3 Answers3

9

The behavior with {,2} is not expected, it is a bug. If you have a look at the TRE source code, tre_parse_bound method, you will see that the min variable value is set to -1 before the engine tries to initialize the minimum bound. It seems that the number of "repeats" in case the minimum value is missing in the quantifier is the number of maximum value + 1 (as if the repeat number equals max - min = max - (-1) = max+1).

So, a{,} matches one occurrence of a. Same as a{, } or a{ , }. See R demo, only abc is matched with ab{,}c:

grepl("ab{,}c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
grepl("ab{, }c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
grepl("ab{ ,   }c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
## => [1] FALSE  TRUE FALSE FALSE FALSE
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks, I'll create an issue on the TRE GitHub. Do you think I should report the bug to the R team as well? – Scarabee Oct 29 '17 at 20:24
  • @Scarabee I am not sure TRE is under active development. R docs also suggest to use PCRE regex engines if you need more stability with regex (see [*The POSIX 1003.2 mode of `gsub` and `gregexpr` does not work correctly with repeated word-boundaries*](https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html)) and I am not sure the R team is actually interested in solving this. I am more inclined to think they will keep it as "by design", as there are other TRE limitations, e.g. [you cannot set the `max` bound to more than `255`](https://ideone.com/o4nY7c). – Wiktor Stribiżew Oct 29 '17 at 20:33
2

Just as an addition:

vec1 = c('','a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa','aaaaaaa')

grep("^a{,1}$", vec1, value = T) # seems to "become" ^a{1}$
grep("^a{,2}$", vec1, value = T) # seems to "become" ^a{0,3}$
grep("^a{,3}$", vec1, value = T) # seems to "become" ^a{0,4}$
grep("^a{,4}$", vec1, value = T) # seems to "become" ^a{0,5}$
Andre Elrico
  • 10,956
  • 6
  • 50
  • 69
0

I am writing this as an answer, because unfortunately I cant add a comment.

Update: Following the answer by Wiktor Stribiżew and feedback, seems the behavior is categories as a bug.

Original: The syntax you are using is just not supported in R (assuming the default engine). This is why you are getting unexpected results.

  • The supported syntax is {n,m} as the documentation states. Thus, you need to specify both boundaries, e.g. {0,2}, which will return the correct result.
  • The syntax {,m}, on the other hand, is missing from the documentation to regex, which silently indicates that it is not supported.

In case you would like to explore differences in syntax, I would recommend taking a look at the regular-expressions.info comparison page. (You need to compare Python and R in terms of Quantifiers in this case.)

Plamen Petrov
  • 317
  • 1
  • 5
  • 1
    Thanks for your help, but as I said in my comment above, my primary concern is not about the lack of consistency across engines. I agree that using `{0,2}` yields the right results, but my point is that `{,2}` should either throw an error ("invalid regex") or return consistent matches. – Scarabee Oct 29 '17 at 15:03
  • 1
    If it’s not supported, it *must* raise an error, not silently output wrong results. – Konrad Rudolph Nov 11 '17 at 15:08