1

I want to write a regex to capture URLs in a text. Now, the problem is any decent regex I use for capturing URLs, encounter a catastrophic backtracking with some URLs.

I have tried "diegoperini" regex in here and also have read other questions and answers in here, here, and here. However none of them solved my problem.

Also I have three other regexes:

Regex:
SIMPLE_URL_REGEX = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
ANY_URL_REGEX = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))"""

The Simple URL regex does not get trapped in cases that I tried, but also does not perform very good and miss many URLs, the other two perform better but get trapped in some cases.

Now, one part of my problem was encoding of non-ASCII URLs, which apparently were solved by decoding the text like this:

try:
    meta = unquote(meta.encode('utf-8')).decode('utf-8')
except TypeError:
    meta = unquote(meta)

But sometime later another problematic URL came up. something like this one:

https://www.example.net/ar/book//%DA%A9-%D8%B3-(%D9%81-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%D9%85)

which are rare, but when happen, trigger a very inefficient backtracking. This backtracking cause the program to stop responding indefinitely. (As I have read in here the problem is that the regex module does not release the GIL.)

Considering all these information, I have two questions:

  • First, Is it possible to have / is there a regex pattern for matching URLs that perform reasonably and avoid catastrophic backtracking completely?

  • Second, If there is not such a regex, Is there another way to catch the cases where the regex get trapped and throw an exception or bypass it some other way?

Atorpat
  • 406
  • 1
  • 4
  • 12
  • 1
    A way consists to use a naive pattern to catch urls in a text and then to validate/filter the urls using a build-in function. Trying to match and to validate urls at the same time with a crazy pattern is IMO a bad idea. – Casimir et Hippolyte Dec 18 '17 at 21:15

2 Answers2

0

This one uses the \x{XXXX} notation for Unicode chars, substitute whatever Java uses.

Also, the biggest issue would be boundary's, the things around the URL.

The one below uses a whitespace boundary, though you can remove it and try your luck.

"(?i)(?<!\\S)(?!mailto:)(?:[a-z]*:\\/\\/)?(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\x{a1}-\\x{ffff}0-9]+-?)*[a-z\\x{a1}-\\x{ffff}0-9]+)(?:\\.(?:[a-z\\x{a1}-\\x{ffff}0-9]+-?)*[a-z\\x{a1}-\\x{ffff}0-9]+)*(?:\\.(?:[a-z\\x{a1}-\\x{ffff}]{2,})))|localhost)(?::\\d{2,5})?(?:\\/[^\\s]*)?(?!\\S)"

Formatted

 (?i)
 (?<! \S )
 (?! mailto: )
 (?:
      [a-z]* :
      \/\/
 )?
 (?:
      \S+ 
      (?: : \S* )?
      @
 )?
 (?:
      (?:
           (?:
                [1-9] \d? 
             |  1 \d\d 
             |  2 [01] \d 
             |  22 [0-3] 
           )
           (?:
                \.
                (?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
           ){2}
           (?:
                \.
                (?:
                     [1-9] \d? 
                  |  1 \d\d 
                  |  2 [0-4] \d 
                  |  25 [0-4] 
                )
           )
        |  (?:
                (?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
                [a-z\x{a1}-\x{ffff}0-9]+ 
           )
           (?:
                \.
                (?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
                [a-z\x{a1}-\x{ffff}0-9]+ 
           )*
           (?:
                \.
                (?: [a-z\x{a1}-\x{ffff}]{2,} )
           )
      )
   |  localhost
 )
 (?: : \d{2,5} )?
 (?: \/ [^\s]* )?
 (?! \S )
0

After extensive search, I found a semi-solution for my problem.

This solution does not change the regex in question, but uses a Timeout Error for raising exception when the regex stuck in backtracking.

I added The Package timeout-decorator and wrote something like this:

from timeout_decorator import timeout, TimeoutError

@timeout(seconds=RE_TIMEOUT)
def match_regex_timeout(compiled_regex, replacer, data):
    return compiled_regex.sub(replacer, data)

The use of function would be something like this:

import logging
logger = logging.getLogger(__name__)

url_match = re.compile(url_regex, flags=re.MULTILINE)
replacer = ' URL '

try:
    text = match_regex_timeout(url_match, replacer, text)
except TimeoutError:
    logging.error('REGEX TIMEOUT ERROR: can not parse URL')
    text = remove_big_tokens(text)

Which basically try to parse the text, and if fail to do that in expected time, would resort to removing big tokens of text, that are likely to be the problematic URLs.

Atorpat
  • 406
  • 1
  • 4
  • 12