RegEx debugging

Question

I'm debugging a Regular Expression ^(A+)*B over a string AAAC (example from rexegg.com) by two separate debugging tools I have access:

regex101.com
RegexBuddy v4

Below is the results (regex101 in the left side):

The question I have is not mainly about the number of steps which is also important to me, but is how backtracks are done. Why do we see differences? (regex101 uses PCRE lib and I set RegexBuddy lib the same)

A comprehensive step by step explanation is really in my favor.

++ for using `^{` tag in your question .. `:-)` I like when a person cares about the design.} — Shafizadeh, Jun 19 '16 at 21:58
@Shafizadeh Still, that’s neither semantically correct, nor is this a sufficient reason to upvote the question, because votes should rather be based on content. — Sebastian Simon, Jun 19 '16 at 22:09
@Xufox You are right .. I just did that once, I won't do that again. — Shafizadeh, Jun 19 '16 at 22:10
`(A+)*` is a backtracking nightmare waiting to happen. Use `(A+)?` instead. — , Jun 22 '16 at 21:37
This `(A+)*` on failure repeats the permutations of `A+` (1..N) with the permutations of `(...)*` (0..N). It becomes exponential. It's ok to wrap chunks with `(..)*` as long as it's contents are constrained locally. A better way is to benchmark suspects using extreme conditions of failure. The problem will show up pretty quick. I'll post some benchmarks. — , Jun 22 '16 at 22:05
Have you written an [evil regex](http://www.regular-expressions.info/catastrophic.html) on purpose? — JDB, Jun 22 '16 at 22:39
@JDB Did you notice that I said example is from rexegg.com? and actually *it is* on purpose. — revo, Jun 22 '16 at 22:45
I don't make a habit of following unknown links from SO posts. Just checking to make sure you knew it was evil (I figured you did). — JDB, Jun 22 '16 at 23:00
@JDB By the way rexegg.com is a famous site around Regular Expressions. I wonder why you used word *unknown* for it. — revo, Jun 25 '16 at 09:57
AFAIK, RegexBuddy uses the JGSoft regex engine, which was built to mimic other regex engines for debugging purposes, so it doesn't *really* use PCRE like regex101 does. As for regex101, the debugger is implemented using the [`PCRE_AUTO_CALLOUT` feature](http://www.pcre.org/original/doc/html/pcrecallout.html) - this lets regex101 inspect the current match status at each position in the pattern, and infer when backtracking occurs from there. So, since the engines' designs are not the same, it's not *that* surprising you get somewhat different display of the same result. — Lucas Trzesniewski, Jun 26 '16 at 16:25
@LucasTrzesniewski I encourage you to gather your comment in an answer as it contains important points to the question. — revo, Jun 27 '16 at 16:18

score 16 · Answer 1 · edited May 23 '17 at 12:00

TL;DR

In short, "backtracking" is when a regex engine returns to a "flexible" match, attempting a different path to get a successful match.

Backtracking with Alternation

For example, in the following pattern and input:

foo(bar|baz)

foobaz

The regex engine will match "foo", then attempt the first of the two options, matching "b" and then "a", but fails at "r". Rather than failing the whole match, though, it will "rewind the tape" and start with the second alternative, matching "b" then "a" and then "z"... success!

Backtracking with Quantifiers

This also works with quantifiers. A quantifier is anything that encourages the engine to match a repeating pattern, including ?, *, + and {n,m} (depending on the engine).

A greedy quantifier (the default) will match as many repetitions as possible before moving on to the rest of the pattern. For example, given the pattern and input below:

\w+bar

foobar

The pattern \w+ will begin by matching the entire string: "foobar". However, when it moves on to the b, the regex engine has reach the end of the input and the match fails. Rather than simply giving up, the engine will ask the last greedy quantifier to give up one of its repetitions, now matching "fooba". The match still fails, so the engine asks \w+ to give up the "a" (failure), and then the "b". After giving up the "b", the engine can now match bar, and the match succeeds.

Trees and Backtracking

Another way of thinking of a regex is as a "tree", and backtracking is going back up a node and trying another path. Given the pattern foo(bar|baz) and the input "foobaz", the engine will attempt something like the following:

foo(bar|baz)
|\
| \
|  : f (match)
|  : o (match)
|  : o (match)
|  | (bar|baz)
|  |\
|  | \
|  |  : b (match)
|  |  : a (match)
|  |  : r (FAIL: go back up a level)
|  |  ^
|  |\
|  | \
|  |  : b (match)
|  |  : a (match)
|  |  : z (match)
|  | /
|  |/
| /
|/
SUCCESS

Counting the "Backtracks"

As to why you see differences in the "number" of backtracks... this probably has a lot to do with internal optimizations and logging level. For example, RegexBuddy does not appear to be logging the match to the empty string before ^, while regex101 does. In the end, though, it doesn't really matter what exact order you backtrack in (what order you climb back up the tree) so long as you end up with the same result.

Evil Regexes

You already know this, but for the benefit of anyone else who happens by, your regex was written to demonstrate "catastrophic backtracking" (aka "evil regex"), where the number of backtrack attempts grows exponentially as the length of the input increases. These regexes can be exploited to perform DoS attacks, so you must use caution not to introduce these into your patterns (as I found out).

score 8 · Accepted Answer · 2016-06-25T01:07:59.300

I wouldn't rely on either the number of steps or any debugging as a test
of either backtracking or failure.

Generally, keep away from simple constructs such as (A+)* that not only
backtrack the inner A+ but backtrack the outter (..)* as well.
Each pass of the outter produces a fresh (new) inner series of backtracking.

Get a good benchmark software like RegexFormat
and test a series for an exponential time result.
A linear result is what you are looking for as the content increases by the same
amount.

Below is an example of (A+)? vs (A+)*. We set the target to a known failure
and steadily increase the length to extend the processing of that failure.

If you look at regex 2 (A+)* you can notice the exponential increase in just
three target increases.
Finally, we blow out the target to overload the internal resources of the engine.

Edit: After some analysis, I post a meager conclusion on backtracking steps.
By analysis of the benchmark time's below, backtracking appears to be a 2^N process.
Where N is the number of character literals that are backtracked on failure.

Given Revo's simple case, it's a bit easier to isolate the backtracking.
Because it looks like %98 of the total time taken involves just backtracking.

Given that assumption, one can take the time results from 2 points, and generate an equation.

The form of the equation I used was f(x) = a * r^x.
Given coordinates (# of 'A's vs. Time taken), using points
(7, 1.13) , (9, 4.43) which generated this f(x) = 0.009475 * 1.9799 ^ x
which is really this # sec's = 0.009475 * 1.9799 ^ # letters
I ran all the number of letter 'A's from the benchmark's below into this formula.

#LTR's   Bench Time
 7         1.13
 9         4.43
13        70.51

which produced the exact benchmark time ( +/- .1% ).

I then saw that 1.9799 is pretty close to 2.0, so I adjusted the 'a' factor down to .009 and ran it again, getting this:

f(7 letters) = 2 ^ 7 * .009 =  1.152 sec’s
f(9 letters) = 2 ^ 9 * .009 =  4.608 sec’s
f(13 letters) = 2 ^ 13 * .009 =  73.728 sec’s
f(27 letters) = 2 ^ 27 * .009 =  1207959.552 sec’s = 335 hours

It seems pretty clear now that backtracking steps correlate to the number of
letters to backtrack in a 2 ^ N relationship.

I also think it's a fair bet that some engines can do this simple 2^N math
only in simple scenario's like this one. These seem to be the times where
the engine comes back immediately and says Too complex!.
The translation here is that the regex is simple enough that the engine can
detect it. Other times, the engine never comes back,
it's hung, and may come back in a year or two (or ten..).

Conclusion therefore is not if the engine will backtrack, it will, but how
will your target string affect the backtracking.

So, vigilance is required when writing regex, and must be on guard against
nested open ended quantifiers.

I'd say the best bet is always to hit a regex real hard to get it to fail.
And I'm talking about excessive repetitive literals in suspect places.
end edit

Benchmark App

Target: AAAAAAAC

Benchmark

Regex1:   ^(A+)?B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    0.07 s,   72.04 ms,   72040 µs


Regex2:   ^(A+)*B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    1.13 s,   1126.44 ms,   1126444 µs

Target: AAAAAAAAAC

Benchmark

Regex1:   ^(A+)?B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    0.08 s,   82.43 ms,   82426 µs


Regex2:   ^(A+)*B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    4.43 s,   4433.19 ms,   4433188 µs

Target: AAAAAAAAAAAAAC

Benchmark

Regex1:   ^(A+)?B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    0.10 s,   104.02 ms,   104023 µs


Regex2:   ^(A+)*B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    70.51 s,   70509.32 ms,   70509321 µs

Target: AAAAAAAAAAAAAAAAAAAAAAAAAAAC

Benchmark

Regex1:   ^(A+)?B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    0.18 s,   184.05 ms,   184047 µs


Regex2:   ^(A+)*B
Options:  < none >
Error:  Regex Runtime Error !!   
Completed iterations:   0  /  50     ( x 1000 )
Matches found per iteration:   -1
Elapsed Time:    0.01 s,   7.38 ms,   7384 µs


Error item 2 :  Target Operation .. 

The complexity of matching the regular expression exceeded predefined
bounds. Try refactoring the regular expression to make each choice made by
the state machine unambiguous. This exception is thrown to prevent
"eternal" matches that take an indefinite period time to locate.

That is a perfect answer one should provide around this topic. But I have some questions here: 1- You didn't talk about *why* do we see differences in steps both debuggers took. Why really? 2- In the third benchmark elapsed time is 70.51s (seconds) for second regex. Is it right?! 3- If you don't mind, please explain third line of each test. — revo, Jun 22 '16 at 22:43
@revo - 1. I don't know how they do their debuggers. It's probably a trivial difference. But fwiw, I added an edit with a _2^N_ backtrack theory. 2. Yes, 70 sec's is correct (see added edit as well). 3. The third line is Iterations completed / Iterations requested (both times 1000). The input box takes a number from 1-9999 amount of iterations requested. There is a check box to use a multiplier of 1000 or not. So basically it can do from 1 - 9,999,000 iterations. Get the app, try it for yourself. [Benchmark](http://www.regexformat.com/scrn7/s7_b4.jpg) — , Jun 25 '16 at 01:02

RegEx debugging

2 Answers2

TL;DR

Backtracking with Alternation

Backtracking with Quantifiers

Trees and Backtracking

Counting the "Backtracks"

Evil Regexes

Linked