Case of poor Python 're' performance

Question

A Sublime package I'm using for autocompletion in lua uses Python's 're' module, but one regex is causing major slowdowns. Here's a minimal example:

import re
rx = re.compile(r"\bfunction(?:\s+[a-zA-Z0-9._]*)?\(((?:[a-zA-Z_][a-zA-Z0-9_]*|\.\.\.|,\s*)*)\)")
rx.match('function f(aaaaaaa, bbbbbbbb, cccccccc, ddddddd eeeeee)')  # Very slow
rx.match('function f(aaaaaaa, bbbbbbbb, cccccccc, ddddddd, eeeeee)')  # Adding a comma between the last function arguments in the string fixes it.

I'm out of my depth with regex debugging, but this seems relevant, though it's over my head.

Does anyone know an equivalent pattern I can use, but which has good performance?

Thank you!

Maybe [submit a bug report](https://github.com/ColonelThirtyTwo/LuaAutocomplete/issues)? — Martin Tournoij, Mar 09 '16 at 11:10
Relevant article: http://davidvgalbraith.com/how-i-fixed-atom/ about the same catastrophic backtracking problem. — BioGeek, Mar 09 '16 at 11:17
That's because of a [*catastrophic backtracking*](http://www.regular-expressions.info/catastrophic.html) which has happened in your regex matching. Actually your regex engine wants to tries all the possibilities until it encounter a comma. — Mazdak, Mar 09 '16 at 11:18
More info about catastrophic backtracking: http://www.regular-expressions.info/catastrophic.html — BioGeek, Mar 09 '16 at 11:18

score 2 · Answer 1 · answered Mar 09 '16 at 11:38

The problem is that the pattern contains a *-quantified * quantifier: (?:[a-zA-Z_][a-zA-Z0-9_]*|\.\.\.|,\s*)*. These nested quantifiers must be avoided. If you cannot re-arrange the alternatives into subsequent (optional) groups, possessive quantifiers or atomic groups usually help.

Python re does not support possessive quantifiers, nor does it support atomic groups.

However, you can use a version with an emulated atomic group:

\bfunction(?:\s+[a-zA-Z0-9._]*)?\(((?:(?=([a-zA-Z_][a-zA-Z0-9_]*))\2|\.\.\.|,\s*)*)\)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The [a-zA-Z_][a-zA-Z0-9_]* subpattern is placed into a capturing group that is enclosed with a positive lookbehind. The (?=([a-zA-Z_][a-zA-Z0-9_]*))\2 is almost equivalent (as it matches the same stuff) to [a-zA-Z_][a-zA-Z0-9_]* but it makes the subpattern atomic (so, in other flavors it would look like [a-zA-Z_][a-zA-Z0-9_]*+ or (?>[a-zA-Z_][a-zA-Z0-9_]*+)). The "atomic" nature of the \2 backreference won't let backtracking go beyond [a-zA-Z_][a-zA-Z0-9_]*.

See the regex demo.

Also, perhaps, you want re.search since your pattern starts with \b. Note that re.match can only find a match at the beginning of the string (= it is anchored at the beginning of the string).

Casimir et Hippolyte · Answer 2 · 2016-03-09T12:09:20.940

The main problem comes from the nested quantifiers (?:[a-zA-Z_][a-zA-Z0-9_]*|\.\.\.|,\s*)* that may cause a catastrophic backtracking when the following subpattern fails.

You can avoid this problem if you unroll^* this subpattern. I used the Verbose mode to make it more readable:

re.compile(r'''\bfunction\b \s*[a-zA-Z0-9._]*
\(

(
  (?:
    (?:[a-zA-Z_][a-zA-Z0-9_]*|\.\.\.)
    (?: ,\s* (?:[a-zA-Z_][a-zA-Z0-9_]*|\.\.\.) )*
  )?
)

\)''', re.VERBOSE)

demo

^(*): basically the idea is to change (a*|b)* to a*(ba*)*

Case of poor Python 're' performance

2 Answers2