3

Note: this question is an outcome from another answer that as of now all its comments are removed.

In case of using a lookaround construct within a RegEx there is a backtrack or a kind of that takes place right before closing bracket. As I'm aware this backtrack comes to output of Perl and PCRE debuggers:

enter image description here

The question is what is this backtrack, why is it there and how is it interpreted as a backtrack?

Community
  • 1
  • 1
revo
  • 47,783
  • 14
  • 74
  • 117

1 Answers1

4

The backtrack is a lie.

It's just a consequence of how the regex101 debugger is implemented. It uses a PCRE feature (flag) called PCRE_AUTO_CALLOUT. This flag tells the PCRE engine to invoke a user-defined function at every step of matching. This function receives the current match status as input.

The catch is that PCRE doesn't tell the callout when it really backtracks. Regex101 has to infer that from the match status.

As you can see, in the step before the "backtrack" occurs, the current matched text is a_, and just after you get out of the lookahead, it's reverted to a. Regex101 notices the matched text is shorter and therefore it infers that a backtrack must have happened, with the confusing outcome you noticed.


For reference, here's the internal PCRE representation of the pattern with auto-callout enabled:

$ pcretest
PCRE version 8.38 2015-11-23

  re> /a(?=_)_b/DC
------------------------------------------------------------------
  0  59 Bra
  3     Callout 255 0 1
  9     a
 11     Callout 255 1 5
 17  17 Assert
 20     Callout 255 4 1
 26     _
 28     Callout 255 5 0
 34  17 Ket
 37     Callout 255 6 1
 43     _
 45     Callout 255 7 1
 51     b
 53     Callout 255 8 0
 59  59 Ket
 62     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options:
First char = 'a'
Need char = 'b'

As you can see, there's no branching opcode there, just an Assert.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • It's not only regex101 (PCRE and Python) debugger that shows it as a backtrack step but Perl debugger does it too. What about them? – revo Sep 06 '16 at 09:12
  • @revo Perl is a different engine, but it all boils down to *how* you define what a backtrack is. If you define it as "go back in the matched string", then yes, a backtrack occurs. If you define it as "try some other alternative", then what occurs is just an assertion ending. My guess is that the definition was guided by the implementation (it was probably just easier to go with the first definition). – Lucas Trzesniewski Sep 06 '16 at 09:16
  • I know that, that's why I point to it as a *kind of* backtrack in question. If it is call out that's the culprit then Perl should have a difference that has not. – revo Sep 06 '16 at 09:20
  • @revo: *"Perl debugger does it too"* I mentioned this before: I don't see any reference to "backtracking" in the Perl re debug output. What makes you think it's there? – Borodin Sep 06 '16 at 09:35
  • Sorry but you didn't mention that before. Re-check `Regexp::Debugger` output for a clear representation of what is going on. @Borodin – revo Sep 06 '16 at 09:40
  • @revo: Ah I see. It says *Back-tracked within regex and reend of positive lookahead*. I don't understand *reend* but I see your point. – Borodin Sep 06 '16 at 10:01
  • I don't get it either. @Borodin – revo Sep 06 '16 at 10:03
  • 2
    @revo: I raised this with the module's author, Damian Conway, who replied with this *"It's a bug in the automatic description generation. I've just patched it and uploaded a new release that changes the message to* ***Resetting match position after end of positive lookahead*** *which is much more descriptive"* – Borodin Sep 07 '16 at 09:00
  • @Borodin nice, that description makes sense – Lucas Trzesniewski Sep 07 '16 at 09:05
  • This is now *entirely* something different. Did you send an email? I appreciate your effort on this. @Borodin – revo Sep 07 '16 at 11:46
  • In case of a lookbehind it shows such description: *Back-tracked within regex and retrying a literal '_' character* What do you think about this? @Borodin – revo Sep 07 '16 at 12:07
  • @revo: What is your exact regex and target string? I think that may be another error that the author may want to fix. I'll get in touch again. – Borodin Sep 07 '16 at 12:23
  • It doesn't differ, in my case it is `/a_(?<=_)b/` @Borodin – revo Sep 07 '16 at 12:31
  • @revo: Well that's different from the forward look-ahead: it clearly has to be. I just wanted a definitive expression to discuss, and I guess that's `'a_b' =~ /a_(?<=_)b/`. Thanks. – Borodin Sep 07 '16 at 20:54
  • Sorry I didn't get your comment. @Borodin – revo Sep 07 '16 at 20:58
  • I'm questioning *"It doesn't differ"* as we have never discussed the look-behind construct at all. – Borodin Sep 07 '16 at 21:12
  • In question I'm asking about lookarounds (both lookaheads and lookbhinds). You asked about exact regex that I'm pointing to, and I said it doesn't differ. Whole regex could be something different. It only needs a lookbehind inside it for `Regexp::Debugger` to produce a similar description. Am I clear? @Borodin – revo Sep 07 '16 at 22:08
  • @revo: Okay, I think you mean "it doesn't matter". "It doesn't differ" means "it isn't different". – Borodin Sep 07 '16 at 22:11
  • Yea, I could say that but I think we were just thinking about a different subject. Nevermind. @Borodin – revo Sep 07 '16 at 22:22