51

I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:

r'(\d+(,)?)+/$'

I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:

re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')

The expected result would be that there were no matches.

This expression, however, executes almost immediately (note the trailing slash):

re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')

Is this a bug?

Mazdak
  • 105,000
  • 18
  • 159
  • 188
abhillman
  • 3,942
  • 1
  • 20
  • 21
  • 7
    [Here's](http://swtch.com/~rsc/regexp/regexp1.html) a great article that can help you understand the situation. – TML Sep 22 '14 at 20:30
  • 6
    Related: [Very slow regular expression search](http://stackoverflow.com/questions/22650098/very-slow-regular-expression-search) and [Why can regular expressions have an exponential running time?](http://stackoverflow.com/questions/8887724/why-can-regular-expressions-have-an-exponential-running-time). – alecxe Sep 22 '14 at 21:22
  • 3
    Notably, there are regex implementations available without these pathological cases -- see for instance https://pypi.python.org/pypi/re2/ – Charles Duffy Sep 22 '14 at 22:07
  • 5
    [`regex`](https://pypi.python.org/pypi/regex) is going to *eventually* replace `re`, and is not only much faster in this case but also filled with goodies and improvements in a billion different ways. – Veedrac Sep 23 '14 at 03:08
  • 1
    For performance I would suggest: `\d(?:\d|,\d)*+/` fast and simple and matches only correct strings – Falco Sep 23 '14 at 08:06
  • [Here's an interactive example](http://regex101.com/r/sM4wM1/1) (from regex101.com). What's interesting is that the timeout occurs when using the JavaScript regex flavor, but not the Python or PHP ones. – Ajedi32 Sep 23 '14 at 14:30
  • 1
    Just for the curious reader: the same thing is used as a Denial of Service attack. Read about it in OWASP web site: https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS – Iravanchi Sep 24 '14 at 08:43

3 Answers3

57

There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.


The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.

For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.


This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.

Sam
  • 20,096
  • 2
  • 45
  • 71
  • 2
    To answer the second part of OP's question: yes it's a bug (in Python). – R.. GitHub STOP HELPING ICE Sep 23 '14 at 03:43
  • 28
    @R.. I don't think it's fair to call a badly behaving algorithm a 'bug'. If someone proposed a n^2 sort, I wouldn't tell them there was a bug in their code, just that it's not *good* or *efficient* code. – sapi Sep 23 '14 at 05:00
  • 4
    I don't know why this post is getting so many votes, while it doesn't explain the actual reason for the catastrophic backtracking (note: nested quantifier doesn't always cause catastrophic backtracking), and the solution is ugly. – nhahtdh Sep 23 '14 at 05:47
  • @R..G can you clarify how it's a bug in Python? Does PCRE (for example) not have the same issue given the same input and pattern? – Nick T Jan 09 '20 at 17:45
  • @NickT: I'm not sure whether PCRE has the same bug or not (many bad regex implementations do), but that doesn't change whether it's a bug. Actually-regular expressions (which these are) can be matched in linear time with possibly-heavy space usage, or bilinear time (linear in both regex pattern length and search text length) without heavy space usage. But bad implementations use backtracking which is exponential-time. – R.. GitHub STOP HELPING ICE Jan 09 '20 at 19:24
32

First off, I must say that this is not a BUG. What your regex is doing is that it's trying all the possibilities due to the nested repeating patters. Sometimes this process can gobble up a lot of time and resources and when it gets really bad, it’s called catastrophic backtracking.

This is the code of findall function in python source code:

 def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.
    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.
    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

As you see it just use the compile() function, so based on _compile() function that actually use Traditional NFA that python use for its regex matching, and base on this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!

The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be. Situations where it has to decide among courses of action include anything with a quantifier (decide whether to try another match), and alternation (decide which alter native to try, and which to leave for later). Whichever course of action is attempted, if it’s successful and the rest of the regex is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the first option, and can continue with the match by trying the other option. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found).

Let's go inside your pattern: So you have r'(\d+(,)?)+/$' with this string '12345121,223456,123123,3234,4523,523523' we have this steps:

  • At first, the first part of your string (12345121) is matched with \d+, then , is matched with (,)? .
  • Then based on first step, the whole string is match due to + after the grouping ((\d+(,)?)+)
  • Then at the end, there is nothing for /$ to be matched. Therefore, (\d+(,)?)+ needs to "backtrack" to one character before the last for check for /$. Again, it don't find any proper match, so after that it's (,)'s turn to backtrack, then \d+ will backtrack, and this backtracking will be continue to end till it return None. So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!

As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).

Now for better understanding, I measure the runtime of the program while changing the length of the string:

'12345121,223456,123123,3234,4523,' 3^33 = 5.559060567×10¹⁵
~/Desktop $ time python ex.py

real    0m3.814s
user    0m3.818s
sys     0m0.000s

'12345121,223456,123123,3234,4523,5' 3^24 = 1.66771817×10¹⁶ #X2 before
~/Desktop $ time python ex.py

real    0m5.846s
user    0m5.837s
sys     0m0.015s

'12345121,223456,123123,3234,4523,523' 3^36= 1.500946353×10¹⁷ #~10X before 
~/Desktop $ time python ex.py

real    0m15.796s
user    0m15.803s
sys     0m0.008s

So to avoid this problem you can use one of the below ways:

  • Atomic grouping (Currently doesn't support in Python, A RFE was created to add it to Python 3)
  • Reduction the possibility of backtracking by breaking the nested groups to separate regexes.
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • `[\d,]+\d+/$` - this is going to match `,,,,,5`. And I think the code of `_compile` is irrelevant to the answer - unless you want to expand on it to demonstrate why it causes the problem. – nhahtdh Sep 24 '14 at 19:38
  • actually the compile function is about defining the `Traditional NFA` that python uses for its regex matching ! so i will add some explain about that too on answer ! – Mazdak Sep 24 '14 at 20:25
  • I edited your post to clean up the formatting. I also dropped non-capturing group from your answer, since it is **not** a solution. – nhahtdh Sep 25 '14 at 15:47
  • @nhahtdh thank for attention and editing so why non-capturing is not a solution i suggest that for refusing of `backtracking` !! – Mazdak Sep 25 '14 at 16:34
  • @Kasra: No, non-capturing group just take away the capturing text property of the `()` syntax. It doesn't affect backtracking. If anything, it reduces the memory usage a little bit, since it doesn't have to keep track of the starting and ending point of the matches. `(?>...)` syntax (atomic group) will prune the search tree and reduces backtracking, but you need to design your pattern right, or it will fail to match what you want. – nhahtdh Sep 25 '14 at 19:37
30

To avoid the catastrophic backtracking I suggest

r'\d+(,\d+)*/$'
Charles
  • 11,269
  • 13
  • 67
  • 105
  • @Sam: I actually got downvoted for this, I can't imagine why. I mean, I can't match your answer, but I think what I wrote was useful (if terse). :) – Charles Sep 23 '14 at 05:06
  • That's odd. I agree, at that point it isn't worth discussing the backtracking or benchmarking that @Kasra and I did..but your answer is simple, effective, and uses no logic that OP is unfamiliar with. – Sam Sep 23 '14 at 05:09
  • 1
    This is provides the best solution in my opinion, and this is the standard formula for matching "token separator token separator token". The downside is that there is no explanation what is going on, though. – nhahtdh Sep 23 '14 at 05:52
  • 2
    Even simpler and faster: `\d(?:\d|,\d)*+/` – Falco Sep 23 '14 at 08:07
  • But I agree your solution is more idiomatic - allthough you should ask a possesive + on all quantifiers, to prevent any form of backtracking – Falco Sep 23 '14 at 08:11
  • 2
    I didn't downvote, but I would guess this was downvoted because it does not answer either of the OP's questions ("Why does this take so long to match?" and "Is it a bug?"). In practical terms, this omission is significant because it will not help the OP (or any later readers of the question) avoid the same mistake in the future with another regex. – apsillers Sep 23 '14 at 12:36
  • For the record, I like your answer, too : ) – abhillman Sep 23 '14 at 16:23
  • @Falco: Not sure how it is simpler - it is quite confusing to read. – nhahtdh Sep 23 '14 at 17:31
  • 3
    @Falco, where are you testing that regex? If it's faster than this one, it's only because of the possessive star (`*+`), which Python doesn't support. But it can't be much faster; Charles' regex is as simple and straightforward as can be. – Alan Moore Sep 23 '14 at 21:19
  • @nhahtda My regex is simply one decimal followed by any number of decimals or decimals preceeded by a comma. I benchmarked with java and php, the possessive + can be replaced by a non backtracking subgroup. No backtracking makes it significantly faster! – Falco Sep 23 '14 at 22:56
  • this is the result of your pattern when we add a `/` at trailing of the string . is the result true ? `>>> re.findall(r'\d+(,\d+)*/$', '12345121,223456,123123,3234,4523,523523/') [',523523']` – Mazdak Sep 24 '14 at 06:18
  • @Falco: It is faster - I didn't deny that. However, a maintainer would be pulling his hair at your regex, since it is quite confusing what it is trying to match. – nhahtdh Sep 24 '14 at 11:30
  • 1
    @nhahtdh I don't know, this is a pretty simple regex: `\d ( \d | ,\d )*+ /` You can also comment it in code: _A digit followed by any number of (digits OR digits preceeded by a comma) and ending with a slash_ – Falco Sep 24 '14 at 15:09
  • 1
    You could also write it `\d(,?\d)*+/` which scans a bit better for me. – Charles Sep 24 '14 at 18:10