9

The following Python code is incredibly slow:

import re
re.match( '([a]+)+c', 'a' * 30 + 'b' )

and it gets worse if you replace 30 with a larger constant.

I suspect that the parsing ambiguity due to the consecutive + is the culprit, but I'm not very expert in regexp parsing and matching. Is this a bug of the Python regexp engine, or any reasonable implementation will do the same?

I'm not a Perl expert, but the following returns quite fast

perl -e '$s="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"; print "ok\n" if $s =~ m/([a]+)+c/;'

and increasing the number of 'a' does not alter substantially the execution speed.

icedwater
  • 4,701
  • 3
  • 35
  • 50
  • 9
    have a look at [catastrophic backtracking](http://www.regular-expressions.info/catastrophic.html). if the engine doesn't optimise away the two `+`s this can get really bad, especially since there is capturing group between the two repetitions, and capturing is comparably expensive. – Martin Ender Nov 01 '12 at 14:25
  • @m.buettner: Yes, and likely Perl's engine is smart enough to filter that out. Make it an answer! – Bergi Nov 01 '12 at 14:29
  • Here's how to benchmark it in Perl: `perl -E 'use Benchmark ":hireswallclock"; $s="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"; $t0 = Benchmark->new; say "ok" if $s =~ m/([a]+)+c/; $t1 = Benchmark->new; say timestr(timediff($t0, $t1))'` – simbabque Nov 01 '12 at 14:29
  • Here's a 5 second timeout from idoene: http://ideone.com/QwOjSE – Kobi Nov 01 '12 at 14:29
  • possible duplicate of [Python Regular Expression: why does this not work?](http://stackoverflow.com/questions/12014991/python-regular-expression-why-does-this-not-work) – Kobi Nov 01 '12 at 14:32
  • 2
    Ah, [there's a bug](http://bugs.python.org/issue13723). – SilentGhost Nov 01 '12 at 14:33
  • @mgilson you can time the execution of both code snippets with any tool of your choice. Does two orders of magnitude fit your definition of "incredibly slow"? It sure does for mine… –  Nov 01 '12 at 14:40
  • @mgilson: It takes `5` *microseconds* on my machine: `python -mtimeit -s'import re2 as re; s="a"*30+"b"; m=re.compile("([a]+)+c").match ' 'm(s)'` and more than `10` *minutes* (quit waiting) with ordinary `re` module. `O(2**n)` is *very* slow. – jfs Nov 02 '12 at 14:44
  • @J.F.Sebastian -- Yeah. I'll delete that comment. It's sort of a standard reaction I have when somebody says a piece of code is "slow" but provides absolutely no evidence to support that assertion. – mgilson Nov 02 '12 at 14:47

2 Answers2

13

I assume that Perl is clever enough to collapse the two +s into one, while Python is not. Now let's imagine what the engine does, if this is not optimized away. And remember that capturing is generally expensive. Note also, that both +s are greedy, so the engine will try to use as many repetitions as possible in one backtracking step. Each bullet point represents one backtracking step:

  • The engine uses as many [a] as possible, and consumes all thirty as. Then it can not go any further, so it leaves the first repetition and captures 30 as. Now the next repetition is on and it tries to consume some more with another ([a]+) but that doesn't work of course. And then the c fails to match b.
  • Backtrack! Throw away the last a consumed by the inner repetition. After this we leave the inner repetition again, so the engine will capture 29 as. Then the other + kicks in, the inner repetition is tried out again (consuming the 30th a). Then we leave the inner repetition once again, which also leaves the capturing group, so the first capture is thrown away and the engine captures the last a. c fails to match b.
  • Backtrack! Throw away another a inside. We capture 28 as. The second (outer repetition) of the capturing group consumes the last 2 as which are captured. c fails to match b.
  • Backtrack! Now we can backtrack in the second other repetition and throw away the second of two as. The one that is left will be captured. Then we enter the capturing group for the third time and capture the last a. c fails to match b.
  • Backtrack! Down to 27 as in the first repetition.

Here is a simple visualization. Each line represents one backtracking step, and each set of parentheses shows one consumption of the inner repetition. The curly brackets represent those that are newly captured for that step of backtracking, while normal parentheses are not revisited in this particular backtracking step. And I leave out the b/c because it will never be matched:

{aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
{aaaaaaaaaaaaaaaaaaaaaaaaaaaaa}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaaaa}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa){a}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaaa}{aaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa){aa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa){a}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a){a}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaa}{aaaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){aaa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){aa}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa){a}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){a}{aaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a){aa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a){a}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a)(a){a}{a}

And. so. on.

Note that in the end the engine will also try all combinations for subsets of a (backtracking just through the first 29 as then through the first 28 as) just to discover, that c does also not match a.

The explanation of regex engine internals is based on information scattered around regular-expressions.info.

To solve this. Simply remove one of the +s. Either r'a+c' or if you do want to capture the amount of as use r'(a+)s'.

Finally, to answer your question. I would not consider this a bug in Python's regex engine, but only (if anything) a lack in optimization logic. This problem is not generally solvable, so it is not too unreasonably for an engine to assume, that you have to take care of catastrophic backtracking yourself. If Perl is clever enough to recognize sufficiently simple cases of it, so much the better.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • 1
    @Axeman The point is that in the given case it cannot find a match of the full pattern (because `c` never matches `b`). If the engine does not contain any optimization for theses cases, it will backtrack through the repetitions anyway (because they might contain more complex expressions, which could lead to a match in the end). – Martin Ender Nov 01 '12 at 14:58
4

Rewrite your regular expression to eliminate the "catastrophic backtracking", by removing the nested quantifiers (see this question):

re.match( '([a]+)+c', 'a' * 30 + 'b' )
# becomes
re.match( 'a+c', 'a' * 30 + 'b' )
Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • My regex is just an example, of corse there is an equivalent one that does not incur into the backtracking problem. The point was not how to obtain an efficent match, but if the inefficency was due to a bug, a suboptimal implementation or an intrinsic problem. –  Nov 01 '12 at 18:43
  • 1
    @Mapio In general the advice to fix is: *rewrite your regex*. (You're right, this was easy in your example and will be more difficult in general!) – Andy Hayden Nov 01 '12 at 19:47