9

Consider the following commands:

text <- "abcdEEEEfg"

sub("c.+?E", "###", text)
# [1] "ab###EEEfg"                          <<< OKAY
sub("c(.+?)E", "###", text)
# [1] "ab###EEfg"                           <<< WEIRD
sub("c(.+?)E", "###", text, perl=T)
# [1] "ab###EEEfg"                          <<< OKAY  

The first does exactly what I expect, basically matching just the first E. The second one should essentially be identical to the first, since all I'm doing is adding a capturing group (though I'm not using it), yet for some reason it captures an extra E. That said, it isn't fully greedy (i.e. if it was it would have captured all the Es). Even weirder, it actually still matches the pattern, even though the sub result suggests the .+? piece left out EE, which can no longer be matched by the rest of the regular expression. This suggests there is an offset issue when computing the length of the matched sub-expression, rather than in the actual matching.

The final one is exactly the same but run with PCRE, and that works as expected.

Am I missing something or is this behavior undocumented/buggy?

BrodieG
  • 51,669
  • 9
  • 93
  • 146

1 Answers1

2

R uses libtre, version 0.8. For more stability, you should always use perl = TRUE.

Note that

sub("c(.+?)E?", "###", text)

works.

Christopher Louden
  • 7,540
  • 2
  • 26
  • 29
  • This is what I've always done, but there are some things not implemented with the `perl = T` flag (`regexec` in particular). My actual bug had come up while trying to use `regexec` (or more specifically, the `str_match_all`/etc. tools in `stringr` that rely on it) and I was similarly able to work around it by adding `.*` after the pattern, though for the `sub` example it obviously doesn't work. It no one else has more info by the morning I'll take this as the answer. Do you know if there are any plans to update the library? Looks like 0.8 has been around for 4 years. – BrodieG Feb 27 '14 at 02:30
  • Actually, looks like the **[TRE library has already been updated](http://cran.r-project.org/bin/windows/base/old/2.15.2/NEWS.R-2.15.2.html)** (search for TRE). – BrodieG Feb 27 '14 at 02:35
  • I fixed my answer to reflect the update. It doesn't look like development is continuing on `libtre`. There are several open issues, [one](https://github.com/laurikari/tre/issues/11) of which is about R. I think this should be raised as a bug to the R development team. – Christopher Louden Feb 27 '14 at 03:33
  • I submitted this to R and got sent packing suggesting I submit it to TRE instead. I submitted it to laurikari as well, though I suspect you're right that it is the same issue you link. – BrodieG Jun 19 '14 at 13:25