4

I was working on the refinement of this answer; and figured out that the regex given below is not working properly(as per its meaning) in R.

 +?on.*$

According to my understanding of regex, the above regex matches:

lazily space one or more times followed by on followed by anything(except newline) till the end.

INPUT:

Posted by ondrej on 29 Feb 2020.
Posted by ona'je on 29 Feb 2020.

OUTPUT (according to me, if above regex pattern in test string is replaced by "")

Posted by
Posted by 

And when I'm trying to test it in python (implementation here), javascript and java (implementation here); I'm getting the result as I expected.

const myString = "Posted by ondrej on 29 Feb 2020.\nPosted by ona'je on";

console.log(myString.replace( new RegExp(" +?on.*$","gm"),""));

On the other hand, if I'm trying to implement the same regex in R (implementation here); I'm getting the result as

Posted by ondrej
Posted by ona'je

and this is unexpected.

Doubt

I thought that maybe regex parser for R works differently(perhaps from right to left). I read the documentation of how regex work in R but found nothing different from other languages for the above regex. I may be missing something here. I am not well-versed with R but as far as my regex knowledge; I believe that the above regex should work as it works in java, javascript and python(may be in pcre too.) for every standard regex engines(as far as I know). My question is why the above regex is working differently in R?

  • I think you would need to set `perl = TRUE` in R to get the same result passed as an option to `regex`. See `?regex`. – edsandorf May 26 '20 at 09:48
  • Thanks @edsandorf. I can write other regex for getting the answer. But, I'm interested in knowing why this behavior is there in `R`? Is my understading of the above regex wrong or does compiler works from right to left or something? –  May 26 '20 at 09:51

1 Answers1

5

It looks like TRE regex engine (used by default in base R regex functions), based on the regex library initially written by Henry Spencer in 1986, matches the shortest match at the end of the string if the first pattern in the regular expression starts with a lazy quantifier and ends with $ anchor.

Compare these cases:

sub(" +?on.*$", "", Data)  # "Posted by ondrej" "Posted by ona'je"
sub(" +?on.*", "", Data)   # "Posted bydrej on 29 Feb 2020." "Posted bya'je on 29feb 2020"
sub(" +?on(.*)", "", Data) # as expected
sub(" +on.*", "", Data)    # as expected

What is going on?

  • The first case is sub(" +?on.*$", "", Data) and the first pattern sets the greediness of all the quantifiers on the same level in the regex. So, the second quantifier, *, will be set to lazy even without ? after it as the first space was quantified with +?, a lazy quantifier. It is a known TRE "bug", also present in some other regex engines based on Henry Spencer's regexl library.

  • The second sub(" +?on.*", "", Data) matches the same way as if it were written " +?on.*?" (again, due to the first pattern setting the greediness level to lazy on that level) and that would only match 1 or more spaces and then on, .*? matches nothing when at the end of the pattern.

  • The third one, sub(" +?on(.*)", "", Data), yields the expected results because the second quantified pattern, .*, is on the other level (one level deep) and its greediness is not affected by the +? that is on another level. So, (.*) matches greedily here.

  • The fourth one, sub(" +on.*", "", Data), yields the expected results because the first pattern is greedy, so the next quantified pattern greediness is also greedy.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for the answer! Why returns `sub(" +on.*?$", "", Data)` only `"Posted by"`? – GKi May 26 '20 at 11:44
  • 1
    @GKi The first pattern is greedy and sets the greediness to greedy for all quantifiers on the same level. This already cancels the first scenario, the first pattern should start with a lazily quantified pattern for it to match the right associative manner. – Wiktor Stribiżew May 26 '20 at 11:46
  • Thanks for the enligtenment @WiktorStribiżew. Just one more doubt I have if in **TRE compliant engines** everytime lazy match comes first, it would turn other quantifiers as lazy **only on the same level**? –  May 26 '20 at 12:41
  • @Mandy8055 Yes, only on the same level. There are not so many engines like this, PostgreSQL, Tcl and TRE are the only ones I know. – Wiktor Stribiżew May 26 '20 at 12:45
  • @WiktorStribiżew Can you please quote the point where it is mentioned that R uses `TRE`. I didn't found this.I was reading up the documentation. I found only about [**POSIX 1003.2 standard**](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html) as the default implementation. –  May 26 '20 at 12:50
  • @Mandy8055 See [Regular Expressions as used in R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html): *This help page is based on the TRE documentation and the POSIX standard, and the pcrepattern man page from PCRE 8.36.* There is a link to the TRE docs, too. – Wiktor Stribiżew May 26 '20 at 12:52
  • @WiktorStribiżew According to your first point if I use different levels; then also it seems to yield undesirable result. Please have a look at [**this**](https://repl.it/repls/DapperPleasantBases). This is not working as is expected from this regex. –  May 26 '20 at 13:10
  • @Mandy8055 This `" +?on(.*)$"` pattern is from scenario one: the starting pattern is lazy and then there is `$` at the end. I described the quantifier greediness levels in Scenario 1, just because it is the first, but that is not making much difference in Scenario 1. When there is no `$` at the end, the behavior is as in scenarios 2 and 3. – Wiktor Stribiżew May 26 '20 at 13:12
  • Thanks @WiktorStribiżew for the answer. IT helped me learn many new things but question is still the same. The pattern ` +?on(.*)$` is producing **Posted by ondrej** in R but not in other regex parser (although I changed the level for the quantifiers). They are producing **Posted by**. Thanks anyways.+1 –  May 26 '20 at 13:18
  • @Mandy8055 If you need to make matching consistent, use a greedy quantifier with spaces, `" +on.*"`. Note that `.` in TRE regex will match across line breaks, and won't do that in JS/PCRE/.NET/Python/Java. You should never expect one and the same pattern work the same in different regex libraries. They may match differently even with the same pattern, as is the current expression. What else do you want to know to make it not "the same"? What answer do you expect? – Wiktor Stribiżew May 26 '20 at 13:20
  • @Mandy8055 What answer do you expect? – Wiktor Stribiżew May 26 '20 at 13:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/214673/discussion-between-mandy8055-and-wiktor-stribizew). –  May 26 '20 at 13:27