1

Yesterday I asked a question latest Perl won't match certain regexes more than 32768 characters long to explain why

>./perl -e 'print (("a".("f"x32768)."a") =~ /a(?:[^a]|bb)*a/)'

does not work as expected (does not match) and the answer was, because of a backtracking depth limitation/bug of Perl regexes.

Why does the above match involve backtracking? The first a is matched, then, all the fs are matched by [^a]*, no need for the alternative bb, and then the last a is not [^a], neither bb, then the engine stops matching ([^a]|bb)* and then matches the last a and done. No backtracking needed.

The bug, occurs, when there is too much backtracking. OK, so the whole thing should work like this. The regex engine starts to match, proceeds to the end without backtracking, finishes with a match, no backtracking needed, the bug is not involved. Why doesn't it work that way?

Community
  • 1
  • 1
Mark Galeck
  • 6,155
  • 1
  • 28
  • 55
  • 1
    [Not sure if this helps](http://imgur.com/BdaoENO), but it seems the backtracking occurs at each alternation point. – Sam Oct 07 '14 at 17:26
  • There's a nice example of backtracking with grouped alternations in [`perlretut`](http://perldoc.perl.org/perlretut.html#Grouping-things-and-hierarchical-matching). The last `a` doesn't match `[^a]`, so you backtrack one character and try the second alternation, `bb`. – ThisSuitIsBlackNot Oct 07 '14 at 17:26
  • @ThisSuitIsBlackNot I _think_ I understand the concepts in your example, and still I don't see the above – Mark Galeck Oct 07 '14 at 17:36
  • @Sam yes it shows in your picture, but I don't understand it ! Where did you get that illustration from, so I can play with the source of it? – Mark Galeck Oct 07 '14 at 17:37
  • 1
    @MarkGaleck I used [Regex101](http://www.regex101.com/)'s online debugger. I didn't really understand it either, but I thought the illustration pointed out an interesting point. – Sam Oct 07 '14 at 17:40
  • @Sam thank you I am studying it now, I don't understand at all, but I am carefully looking, thank you thank you – Mark Galeck Oct 07 '14 at 17:43
  • Here is another interesting tidbit that may be related to the original issue of "recursion depth is limited to 32766": [every group in a regular expression takes a step to step into and out of the group](http://stackoverflow.com/questions/26093501/atomic-groups-clarity/26141971#26141971). – Sam Oct 07 '14 at 17:46
  • @ThisSuitIsBlackNot yes I understand why backtracking occurs at the last character, once. that is not the problem. The problem is, why does it occur at each previous character, 32768 times?? – Mark Galeck Oct 07 '14 at 19:07
  • @MarkGaleck You can use the debugger to see exactly what's going on: `perl -Mre=debug -e '"affffa" =~ /a(?:[^a]|bb){0,3}a/'` I *think* this should be analogous to `"a"."f"x32768."a" =~ /a(?:[^a]|bb)*a/` since Perl treats `*` as `{0,32767}` (except you won't get the warning). – ThisSuitIsBlackNot Oct 07 '14 at 19:40

1 Answers1

2

There's no backtracking for affffffa, but that doesn't mean the pattern can't backtrack. It'll require backtracking to find that affffff doesn't match. As such, the pattern needs to be ready to backtrack.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Wait, I am confused, the pattern _should_ match, if it were not for the bug. The bug, occurs, when there is too much backtracking. OK, so the whole thing should work like this. The regex engine starts to match, proceeds to the end without backtracking, finishes with a match, no backtracking needed, the bug is not involved. Why doesn't it work that way? – Mark Galeck Oct 07 '14 at 17:41
  • Re "The bug, occurs, when there is too much backtracking.", uh, no. No backtracking occurs for `afff...fffa` – ikegami Oct 07 '14 at 17:43
  • well, it says `The regular expression engine uses recursion in complex situations where back-tracking is required. Recursion depth is limited to 32766` Well, backtracking should not be _required_ here – Mark Galeck Oct 07 '14 at 17:44
  • 1
    The pattern needs to be able to backtrack because it's needed for `affffff`. You can't know ahead of if backtracking will actually occur or not. – ikegami Oct 07 '14 at 17:46
  • I am sorry for being so dumb, but I don't understand, this is greedy quantifier, it does not care to find out if the partial string matches or not – Mark Galeck Oct 07 '14 at 17:47
  • 2
    @MarkGaleck: whatever the situation (presence of an "a" at the end or not), the regex engine must record each backtracking positions in prevision (if one or more backtracking steps would be needed). If the max number of recorded positions is reached, (because it's obviously limited), you obtain this error that means that the regex engine can't record more positions. – Casimir et Hippolyte Oct 07 '14 at 17:57
  • @CasimiretHippolyte let me get this straight - whenever there is an `*` (greedy quantifier) and it matches at least 32768 characters, there is a potential for this bug to show up. It is explained in the diag docs, that if the regex is "complex", recursion is used for backtracking. So, the bottom line, is that if `*` is used to match at least 32768 chars and the regex is "complex", the bug will show up. Is that what you are saying? – Mark Galeck Oct 07 '14 at 18:13
  • Whoever upvoted this answer, that is great, I would like to do so as well, but please explain to me why this is a good answer, because I don't understand it. There should be no need to worry about partial matches, we are using the greedy quantifier `*` here. – Mark Galeck Oct 07 '14 at 18:16
  • @MarkGaleck: it's not a bug, it's only a limit (and i really don't know if and how you can increase it if you decide to compile perl yourself). But indeed, if you use `.` in place of `(?:[^a]|bb)` the problem will not occur because the regex engine has nothing to record except a range (instead of a list of positions). I assume that the perl regex engine is smart enough to not record each positions when `.` is used. – Casimir et Hippolyte Oct 07 '14 at 18:38
  • @CasimiretHippolyte well, forget `.`, if I change `bb` to `b` the bug does not occur, presumably because, the regex is not as "complex" anymore. – Mark Galeck Oct 07 '14 at 19:05
  • Great, people keep upvoting and I don't understand the answer. Tough for me I guess. – Mark Galeck Oct 07 '14 at 19:06
  • 1
    Re "but please explain to me why this is a good answer, because I don't understand it", You asked why the patterns needs to be able to backtrack. I gave an example where backtracking happens. – ikegami Oct 07 '14 at 19:23
  • @ikegami I think I start to understand. I did not ask "why it needs to be able to backtrack". I asked why it _does_. I guess what you and Casimir are saying, is "it does not know whether it will have to backtrack or not", so it has to do some bookkeeping in order to enable backtracking. Wow, this is certainly not written anywhere in the Camel book, nor in the bug description on the diags site. – Mark Galeck Oct 07 '14 at 19:28
  • 1
    @MarkGaleck: in general, regex engines "pre-analysis" are poorly documented or are hidden anywhere. But you can be sure that you can't find this kind of precisions in a general documentation. – Casimir et Hippolyte Oct 07 '14 at 19:46