11

I'm not sure I completely understand what is going on with the following regular expression search:

>>> import re
>>> template = re.compile("(\w+)+\.")
>>> target = "a" * 30
>>> template.search(target)

search() call takes minutes to complete, CPU usage goes to 100%. The behavior is reproduceable for both 2.7.5 and 3.3.3 Python versions.

Interesting fact that if the string is less than 20-25 characters in length, search() returns like in no time.

What is happening?

P̲̳x͓L̳
  • 3,615
  • 3
  • 29
  • 37
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 3
    possible duplicate of [Why can regular expressions have an exponential running time?](http://stackoverflow.com/questions/8887724/why-can-regular-expressions-have-an-exponential-running-time) – Tim Peters Mar 26 '14 at 02:05
  • 1
    A case of [ReDoS](http://en.wikipedia.org/wiki/ReDoS)? – miku Mar 26 '14 at 02:05
  • @paxdiablo it's actually a synthetic simplified example, that I've accidentally reproduced. Haven't understood it right away and decided to ask for help, thanks. – alecxe Mar 26 '14 at 02:11
  • @miku that's the answer, TIL, thx~ ;) – zhangxaochen Mar 26 '14 at 02:28
  • As it's looking for 1 or more alpahnum chars, the only way it can terminate is to find a non alphanum char. Then you are asking it to repeat that loop outside of the capture group with the second +. So if it can't find a char that terminates it early, such as "." then it keeps looping. It should however know to stop when it hits the end of the string. This is pure conjecture. If you can somehow see the generated fsa, that would indicate the true answer. – Simon Mar 26 '14 at 02:30
  • 1
    Explained in length by Friedl here: http://www.foo.be/docs/tpj/issues/vol1_2/tpj0102-0006.html – Robin Mar 31 '14 at 11:26
  • 1
    One great resource is here: http://swtch.com/~rsc/regexp/regexp1.html – Patrick Collins Apr 02 '14 at 23:31

3 Answers3

12

Understanding this problem requires understanding how NFA works under RegExp.

Elaborating the definition of NFA may be a mission too heavy for me. Search NFA on wiki it will gives you a better explanation. Here just think NFA is a robot finding patterns you give.

Crudely implemented NFA is somewhat dumb, it just looks ahead one or two tokens you give. So in the synthetic example you give, NFA just looks \w+ at first (not parenthesis for grouping).

Because + is a greedy quantifier, that is, matches as many characters as possible, so NFA dumbly continues to consume characters in target. After 30 as, NFA encounters the end of string. After then does NFA realize that he needs to match other tokens in template. The next token is +. NFA has matched it so it proceeds to \.. This time it fails.

What NFA does next is to step one character back, trying to match the pattern by truncating the submatching of \w+. So NFA split the target in to two groups, 29 as for one \w+, and one trailing a. NFA first tries to consume the trailing a by matching it against the second +, but it still fails when NFA meeting \.. NFA continues the process above until it gets a full match, otherwise it will tries all possible partitions.

So (\w+)+\. instructs NFA to group target in such manner: target is partitioned into one or more groups, at least one character per group, and target is end with a period '.'. As long as the period is not matched. NFA tries all partitions possible. So how many partitions are there? 2^n, the exponential of 2. (JUst think inserting separator between a). Like below

aaaaaaa a
aaaaaa aa
aaaaaa a a
.....
.......
aa a a ... a
a a a a a .... a

If NFA matches \., it won't hurt much. But when it fails to match, this expression is doomed to be never-ending exponential .

I'm not advertising but Mastering Regular Expression is a good book to understand mechanism under RegExp.

Herrington Darkholme
  • 5,979
  • 1
  • 27
  • 43
  • BTW, RegExp engines have plenty of optimizations, they are smarter than what I describe above. So exponential nightmare is less likely to happen in an advanced RegExp engine. – Herrington Darkholme Mar 26 '14 at 02:57
6

The slowness is caused by backtracking of the engine:

(\w+)+\.

Backtracking will naturally occur with this pattern if there's no . at the end of your string. The engine will first attempt to match as many \w as possible and backtracks when it finds out that more characters need to be matched before the end of your string.

(a x 59) .
(a x 58) .
...
(a) .

Finally it will fail to match. However, the second + in your pattern causes the engine to inspect (n-1)! possible paths, so:

(a x 58) (a) .
(a x 57) (a) (a) .
(a x 57) (a x 2) .
...
(a) (a) (a) (a) (a) (a) (a) ...

Removing the + will prevent an abnormal amount of backtracking:

(\w+)\.

Some implementations will also support possessive quantifiers, which might be more ideal in this particular scenario:

(\w++)\.
Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
1

The second plus is causing issues:

template = re.compile("(\w+)\.")

works fine for me. To see the parse tree for the regex, pass in re.DEBUG as the 2nd argument to compile:

import re

re.compile("(\w+)+\.", re.DEBUG)
print "\n\n"
re.compile("(\w+)\.", re.DEBUG)


max_repeat 1 65535
  subpattern 1
    max_repeat 1 65535
      in
        category category_word
literal 46


subpattern 1
  max_repeat 1 65535
    in
      category category_word
literal 46

Process finished with exit code 0

That proves that the second plus is adding a loop, which the python regex parser must cap at 65535. That somewhat proves my theory.

Note that to run that you will want a fresh python interpreter for each execution. re.compile memoizes the values passed in, so it will no re-compile the same regex twice, repeatedly running that in ipython for instance does not print out the parse tree after the first time you run it.

Simon
  • 2,840
  • 2
  • 18
  • 26