-1

I have a requirement to extract URLs from various text inputs. The below solution is working 99% of the time but chokes on some inputs like the one below. I've never had a regular expression max out a CPU on such a small input. Any pointers on what is wrong with the regex are appreciated.

import re


url_regex = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"""


text = """background-image:url(http://109.87.135.101/infos-auto/files/2013/07/Ex-banni"""

urls = re.findall(url_regex ,text)

print urls

edit: I got this regex from here: https://github.com/rcompton/ryancompton.net/blob/master/assets/praw_drugs/urlmarker.py

Chris Hall
  • 871
  • 6
  • 13
  • 21
  • 2
    It's called [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). You need to re-study all requirements again. – revo Jul 03 '18 at 19:12
  • Check [Regex to find urls in string in Python](https://stackoverflow.com/a/6883094/3832970). – Wiktor Stribiżew Jul 03 '18 at 19:14
  • This is your problem `\(([^\s()<>]+|(\([^\s()<>]+\)))*\)`, want to know how to fix it ? –  Jul 03 '18 at 19:17

1 Answers1

0

Update

After looking at your regex a bit more, it seems you have multiple problems.
Yes, catastrophic backtracking is part of it.
But, this doesn't have to be a problem. What exasperates it is the overlapping negative classes in it.

You have a larger negated class [^\s`!()\[\]{};:'".,<>?«»“”‘’]
that is causing the problem.
Because this negated class [^\s()<>] is a subset of that one, it allows
more characters.
Since the larger class is at the end of the regex, this is what is
causing the backtrack problem.

The last two clusters are functionally identical except for the larger class.

You can thus combine them to both eliminate backtracking and make your
regex more efficient.

Final Update

The larger negated class was at the end in the original.
What we will do is still combine the last clusters using the smaller
negated class, allowing more to pass through initially.
We take away the nested quantifiers to make it catastrophic backtracking bullet proof
from this combined cluster.

Finally we add a lookbehind assertion at the end that takes the smaller class
out of the larger class producing a new negated class [^`!\[\]{};:'".,?«»“”‘’] to be used in the assertion.

We can do this because the cluster before it doesn't allow those characters.

The result is it still backtracks, but by a large degree, not so much.

You must realize this type of regex is inherently prone to backtracking,
because of the differential between the large and small negated classes.

As far as Is this the best it can be ?
The answer is Yes there is no other modification that can make it better.

A note, you be sure to tell the person/place or thing you got this from
it is a pretty crappy regex. Use this instead !!

r"(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?<=[^`!\[\]{};:'\".,?«»“”‘’]))"

https://regex101.com/r/5nyZXD/1

Readable version

 (?i)
 \b 
 (                             # (1 start)
      (?:
           [a-z] [\w-]+ :
           (?: /{1,3} | [a-z0-9%] )
        |  www \d{0,3} [.] 
        |  [a-z0-9.\-]+ [.] [a-z]{2,4} /
      )
      (?:
           [^\s()<>] 
        |  

           \(
           (                             # (2 start)
                [^\s()<>] 
             |  ( \( [^\s()<>]+ \) )          # (3)
           )*                            # (2 end)
           \)
      )+
      (?<= [^`!\[\]{};:'".,?«»“”‘’] )

 )                             # (1 end)
  • There's still at least one more catastrophic backtracking problem in there, a bit before the second capturing group. Also, even with the "here->" marker, it's hard to see what you changed. – user2357112 Jul 03 '18 at 19:44
  • @user2357112 - I don't think there is, but starting from the beginning, just quote me up to where it is. I only removed the _first_ mentioned, not the second. And, I did not remove the second potential for a reason. –  Jul 03 '18 at 19:48
  • The first instance of `[^\s()<>]+` in your new regex is still inside another `+`, in a way that could lead to exponential catastrophic backtracking. It's also redundant with the `[^\s()<>]` that's part of the other side of the alternation, which could lead to more backtracking problems, and I don't know if that's all. – user2357112 Jul 03 '18 at 19:53
  • @user2357112 - So you're talking about my _Note_ which mentions that. No, not really a problem on this outer one. Not all nested quantifiers are bad, depends on the usage. The inner one keeps backtracking back until the `(` which is the problem. On the second restart (from `(`) it gets bumped to the outter one because it is in _backtrack_ mode. This then causes the exponential/compount backtracking. You release the bactracker from the compound mode by removing the _inner_ quantifier. Hence my reasoning. Although the outter one could be removed if worried about it (as said in my _Note_). –  Jul 03 '18 at 20:07
  • 1
    I think you may be looking at the wrong quantifiers. Here's an [empirical example](https://ideone.com/ktGDSc) of the regex still timing out on a 5 second time limit due to catastrophic backtracking. – user2357112 Jul 03 '18 at 20:16
  • @user2357112 - Yeah, it's gonna backtrack a little on the outer nest. Certainly not catastrophic to timeout. https://regex101.com/r/8Wvieo/1 I wouldn't trust ideone. The thing is, in this case, the benefit outweighs it with speed with a nested greedy quantifier in that position. Here are the step difference removing the `+` at both places. Again, this is the worst case scenario. https://regex101.com/r/M8Tddk/1 and https://regex101.com/r/inMed9/1. If it were me, since there's an alternation involved, I'd remove it in both spots. –  Jul 03 '18 at 20:34
  • 1
    [Here's an example of regex101 reporting catastrophic backtracking too.](https://regex101.com/r/RyCPoE/1) – user2357112 Jul 03 '18 at 20:37
  • @user2357112 - There are other issues not related to catastrophic backtracking. https://regex101.com/r/hF3ol6/1 It shouldn't be doing this. Got any ideas ? –  Jul 03 '18 at 20:48
  • I don't know whether it's supposed to do what it does in that example. The questioner dumped a huge, complex, buggy regex on us without enough explanation, which is part of why I haven't tried to post a corrected version. I can be pretty sure the questioner doesn't want catastrophic backtracking, but I don't know the exact behavior they *do* want. – user2357112 Jul 03 '18 at 21:07
  • @user2357112 - No problem, I posted a fix for both problems. –  Jul 04 '18 at 16:00
  • That new regex only extracts the domain portion of the URL and truncates the URI..It seems the only regexes that can successfully extract an entire URL are the ones with the catastrophic backtracking – Chris Hall Jul 05 '18 at 13:52