1

I am trying to study regex from multiple sources, but came across a confusion about backtracking, because one defines backtracking, that it means the state when a regex engine fail to match a pattern, so it backtracks to the position where the first atom was matched, so for example,to match cat in He captured a catfish for his cat the engine will step like this:

  • It will search for c until it match at c in captured
  • then the same for a
  • but fail to match t with p
  • the engine reset back to position after c in captured, because it knew that no match will happen before this point.

So in all cases with no quantifiers, the engine will be reset the whole pattern to try match it again from different position.

Another defines backtracking as the state of using a quantifier like .* so the regex engine will match the complete text which makes it fail so it backtracks one by by one until match occurs. i think these are not the same.

As stated here:

A fundamental feature of regular expression matching involves the notion called backtracking. which is used (when needed) by all regular expression quantifiers, namely *, *?, +, +?, {n,m}, and {n,m}?.

For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part-that's why it's called backtracking.

Which means a pattern like ([A-Z][A-Z0-9]*)\b[^>]*>.*<\/\1> to match Testing <B><I>bold italic</I></B> text. will work like this:

  • It will first match the <B>
  • Then the pattern need to match the .* which will match up to the end of the string.
  • then it will try to match the < but it has already reached the end so it will backtrack one character by one until it matches.

In contrast to the first cat example which completely resets the engine to the very first atom and start again from the position matched the first atom.

But in another situation if i add ? after .*, the regex skips this atom .*? trying to match the remaining characters, but if no match it falls back to the . to match until there is a < and then start matching atoms after it.

I think there are multiple definitions here, would any one explain which on of these cases is backtracking.

Community
  • 1
  • 1
Makiomar
  • 106
  • 1
  • 9
  • Actually yes, they are the same. Consider `ca+?t` (which uses a quantifier) - it backtracks exactly the same as `cat` on your sample input – Bergi Feb 06 '19 at 18:41
  • @WiktorStribiżew i think you are right, so i have made some edits to the question, which i need to reconsider and add more details about each case, with your comment as an answer. – Makiomar Feb 07 '19 at 06:27
  • I have read some more on the Web, and it seems that the main point about *backtracking* is that *"regular expression engine returns to a previous saved state to continue its search for a match"*. – Wiktor Stribiżew Feb 07 '19 at 10:16
  • It may happen when there are non-fixed length quantifiers and alternation constructs. Just the way of *matching* is a bit different with greedy (grabs all it can and then yields char by char from the end of the grabbed text to accommodate for the next pattern) and non-greedy (the non-greedy pattern is skipped first, all subsequent ones are tested and then the non-greedy pattern is "expanded", grabs one more char, and the subsequent patterns are tried again) quantifiers. – Wiktor Stribiżew Feb 07 '19 at 10:16

1 Answers1

1

Let's check some backtracking definitions.

"The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be. If the attempted option is successful and the rest of the regex is also successful, you are finished with the match. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the option and can continue with the match by trying another. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found)." Mastering Regular Expressions Powerful Techniques for Perl and Other Tools, Jeffrey E. F. Friedl, p.102

Another one:

"When a regular expression includes optional quantifiers or alternation constructs, the evaluation of the input string is no longer linear. Pattern matching with an NFA engine is driven by the language elements in the regular expression and not by the characters to be matched in the input string. Therefore, the regular expression engine tries to fully match optional or alternative subexpressions. When it advances to the next language element in the subexpression and the match is unsuccessful, the regular expression engine can abandon a portion of its successful match and return to an earlier saved state in the interest of matching the regular expression as a whole with the input string. This process of returning to a previous saved state to find a match is known as backtracking." (Backtracking in Regular Expressions, Microsoft docs)

Yet another one:

"For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part-that's why it's called backtracking." (source)

It appears that the central part of backtracking is the "return" process to an earlier state to re-evaluate (re-match) the string to make the entire expression match. It is not restricted as to how this should be done to call the process as such.

None of the sources limits backtracking to only greedy or non-greedy quantifiers. You may read about the difference in greedy and non-greedy pattern behavior in my former answer, Can I improve performance of this regular expression further. In brief, they differ in the mechanism, the way of getting back and re-matching, but the essence is the same, as described in the definitions above.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Stribizew Thank you, it now understand and also i supported my understanding with reading a little bit about finite state machine which gave me a more clear vision of how regex engine works. – Makiomar Feb 27 '19 at 09:44