I am trying to study regex from multiple sources, but came across a confusion about backtracking, because one defines backtracking, that it means the state when a regex engine fail to match a pattern, so it backtracks to the position where the first atom was matched, so for example,to match cat
in He captured a catfish for his cat
the engine will step like this:
- It will search for
c
until it match atc
incaptured
- then the same for
a
- but fail to match
t
withp
- the engine reset back to position after
c
incaptured
, because it knew that no match will happen before this point.
So in all cases with no quantifiers, the engine will be reset the whole pattern to try match it again from different position.
Another defines backtracking as the state of using a quantifier like .*
so the regex engine will match the complete text which makes it fail so it backtracks one by by one until match occurs. i think these are not the same.
As stated here:
A fundamental feature of regular expression matching involves the notion called backtracking. which is used (when needed) by all regular expression quantifiers, namely *, *?, +, +?, {n,m}, and {n,m}?.
For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part-that's why it's called backtracking.
Which means a pattern like ([A-Z][A-Z0-9]*)\b[^>]*>.*<\/\1>
to match Testing <B><I>bold italic</I></B> text.
will work like this:
- It will first match the
<B>
- Then the pattern need to match the
.*
which will match up to the end of the string. - then it will try to match the
<
but it has already reached the end so it will backtrack one character by one until it matches.
In contrast to the first cat
example which completely resets the engine to the very first atom and start again from the position matched the first atom.
But in another situation if i add ?
after .*
, the regex skips this atom .*?
trying to match the remaining characters, but if no match it falls back to the .
to match until there is a <
and then start matching atoms after it.
I think there are multiple definitions here, would any one explain which on of these cases is backtracking.