16

Both are supposed to the best of my knowledge to be the same but I actually see a difference, look at this minimal example from this question:

a<-c("/Cajon_Criolla_20141024","/Linon_20141115_20141130",
"/Cat/LIQUID",
"/c_puertas_20141206_20141107",
"/C_Puertas_3_20141017_20141018",
"/c_puertas_navidad_20141204_20141205")

sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
[1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107" "20141018"   
[6] "20141205"   

sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206" "20141017"   
[6] "20141204" 

What am I missing? Or is this a bug?

Community
  • 1
  • 1
cmbarbu
  • 4,354
  • 25
  • 45
  • In a semantic meaning, the two patterns are exactly the same, the only difference I know is that any regex engines are more performant with one of these two syntaxes for intrinsic reasons, but nothing of this explain the different results you obtain here. – Casimir et Hippolyte Feb 25 '15 at 22:02
  • 2
    Interestingly, if you change the parentheses to `"(.*)?_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]).*$"` it works – Rich Scriven Feb 25 '15 at 22:32
  • 1
    @RichardScriven, but that changes the meaning of the regular expression, see: `sub("(.*)?", "", "___x", perl=T)` VS `sub("(.*?)", "", "___x", perl=T)`. In this case, it doesn't matter though. – BrodieG Feb 25 '15 at 22:41
  • @BrodieG It does matter as I need the second result, not the second. – cmbarbu Feb 25 '15 at 22:50
  • @cmbarbu, agreed, I wasn't looking closely. – BrodieG Feb 25 '15 at 23:08

3 Answers3

12

This is a bug in the TRE library related to greedy modifiers and capture groups. See:

Community
  • 1
  • 1
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • The first quantifier used in the pattern is non-greedy and the greedy/ungreedy switcher modifier isn't used, not sure that it is the same bug. – Casimir et Hippolyte Feb 25 '15 at 22:45
  • @CasimiretHippolyte, certainly possible it is not the same bug, but there is a greedy modifier in a capture group `(.*?)` and as per Richard's experiment, changing that to `(.*)?` (i.e. taking the greedy modifier out of the capture group) fixes it, so it seems likely it is the same problem, or at a minimum very closely related. – BrodieG Feb 25 '15 at 22:48
  • `.*?` is a non-greedy quantifier. `.*` is a greedy quantifier. `(?U)` is the modifier that inverses these meanings. – Casimir et Hippolyte Feb 25 '15 at 22:50
  • @CasimiretHippolyte, when I say greedy modifier, I mean "modifier of greediness". I will strive to be more precise in the future ;) – BrodieG Feb 25 '15 at 22:50
  • `(.*)?` has a totally different behavior/meaning and is not pertinent here. – Casimir et Hippolyte Feb 25 '15 at 22:53
  • I realize `(.*)?` has a completely different meaning (see my comment to Richard). But, to confirm that modifying greediness is the issue, consider: `sub("([^2]*)_([0-9]{8})(.*)$","\\2",a)` which works properly. Again, not proof positive, but I find it hard to believe the issues are not related. – BrodieG Feb 25 '15 at 23:07
  • @BrodieG does my new answer make sense to you? – cmbarbu Feb 25 '15 at 23:33
4

Setting perl=TRUE gives the same answer (as expected) for both expressions:

> sub("(.*?)_([0-9]{8})(.*)$","\\2",a,perl=TRUE)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"    "20141204"   
> sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a,perl=TRUE)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"    "20141204"
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Metrics
  • 15,172
  • 7
  • 54
  • 83
2

Though I was initially convinced by BrodieG answer, it seems that [0-9] n times and [0-9]{n} are indeed different, at least for the "tre" regexp motor. According to http://www.regular-expressions.info the {} operator is greedy, [0-9] is not.

Hence the right regular expression in my case should be:

    sub("(.*?)_([0-9]{8}?)(.*)$","\\2",a)

Making all the difference:

    sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
    [1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107"    "20141018"   
    [6] "20141205"   
    sub("(.*?)_([0-9]{8}?)(.*)$","\\2",a)
    [1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"   
    [6] "20141204"

And even

    > sub("(.*)_([0-9]{8}?)(.*)$","\\2",a)
    [1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"   
    [6] "20141204"

Interpretation: 1) tre considers ? as "evaluate next atom the first time you can match this atom". This is always true for ".?" as everything matches, and it switches to _[0-9]{8}. When reaching the first group of 6 numbers, if {} is not greedy (no ?), as (.) matches also the first 8 numbers, the search continues to see if an other occurrence of "_[0-9]{8}" can be found on the line. If meeting the second set of 8 figures, it also memorizes it as a matching pattern, then it reaches the end of the line, the last matching pattern is kept and [0-9]{8} is matched to the second set of 8 numbers.

2) When {} operator is modified by ? The search stops the first time it sees 8 numbers, check if _(.*) can be matched to the rest. It can, so it returns the first set of 8 numbers.

Note that the perl regexp motor works differently,

1) ? after {} doesn't change a thing:

     sub("(.*)_([0-9]{8})","\\2",a,perl=TRUE)
     [1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107"    "20141018"   
     [6] "20141205"   
     sub("(.*)_([0-9]{8}?)","\\2",a,perl=TRUE)
     [1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107"    "20141018"   
     [6] "20141205"

2) the ? applied to .* makes it to stop at the first set of 8 figures:

     sub("(.*?)_([0-9]{8}).*","\\2",a,perl=TRUE)
     [1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"   
     [6] "20141204"   
     sub("(.*)_([0-9]{8}).*","\\2",a,perl=TRUE)
     [1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107"    "20141018"   
     [6] "20141205" 

From these two observations, it seems that the two engines interpret differently the greediness in two different instances. I always found the greediness concept to be a bit vague ...

cmbarbu
  • 4,354
  • 25
  • 45
  • 3
    I'm pretty sure that the greediness of `{` is only meaningful when you are using wild card forms (e.g. `{1,8}`). What may be happening here is that the presence of two greed modifiers in capture groups offsets the bug in the TRE library. – BrodieG Feb 25 '15 at 23:34
  • Nevermind that greediness is meaningless for a fixed width pattern like `{8}` (though that is an issue here), the key problem is the greediness of the leftmost pattern. Regular expressions are matched left to right, so the greediness of the right pattern should not affect the greediness of the left since that is matched first. Yet that is happening. The reason why PERL and TRE are different is because there is a bug in the TRE library. – BrodieG Mar 08 '15 at 15:20