What makes this regular expression very slow?

Question

I'm developing an application where users can use regular expressions to search and extract information from text. One user inserted the following regexp: ([^\s]*( ?)){3}\n (they added a blank line at the end of the expression). This can be translated to Python:

import re

regexp = re.compile(r'([^\s]*( ?)){3}\n')

The regexp is used to extract an URL from a set of URLs separated by spaces. When you run the following command, the time spent evaluating the search is about half a second on my machine.

regexp.search(u'https://www.example.com/data/products/64973s3d.jpg https://www.example.com/data/products/j4e0ls.jpg https://www.example.com/data/products/sfd69es6d9.jpg https://www.example.com/data/products/6sd9fw.jpg https://www.example.com/data/products/64973s3d.jpg https://www.example.com/data/products/j4e0ls.jpg https://www.example.com/data/products/64973s3d.jpg https://www.example.com/data/products/j4e0ls.jpg https://www.example.com/data/products/64973s3d.jpg https://www.example.com/data/products/j4e0ls.jpg https://www.example.com/data/products/64973s3d.jpg https://www.example.com/data/products/j4e0ls.jpg')

When you remove the new line at end of the regexp (([^\s]*( ?)){3}), the search is much faster.

Why does this happen? And is it possible to prevent this behavior so that users cannot slow down the whole application by using similar (slow) regular expressions?

There was a (conceptually) similar question yesterday: https://stackoverflow.com/questions/45463148/fixing-catastrophic-backtracking-in-regular-expression See especially Wiktors explanation of the general pattern, that is also in use here (well your pattern is `(a*b?)+c`, but that doesn't make it any better). — Sebastian Proske, Aug 03 '17 at 13:35
Concerning preventing - I don't see any possibility other than timouting your function. — Sebastian Proske, Aug 03 '17 at 13:38
Yes, the reason is the same: once a `(a+b?)+` is inside a pattern (not at the end of the pattern), catastrophical backtracking is imminent. And note there is no possessive quantifier, nor atomic group support in Python `re`. — Wiktor Stribiżew, Aug 03 '17 at 13:38

What makes this regular expression very slow?

0 Answers0