Python infinite loop in regex to match url

Question

I am trying to extract URLs from text file and stuck in an infinite loop

import re

URL_PATTERN = re.compile(ur'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''')

with open("some_text_file") as RAW:
    for line in RAW:
        RESULT = URL_PATTERN.findall(line)
        links = []
        for HTTP_TUPLES in RESULT:
            links.append(HTTP_TUPLES[0])

How i can avoid that?

PS: Yes, i know about urllib and other modules

@vks, the OP seems to say that evaluating the regular expression enters a loop — akonsu, Jan 28 '15 at 05:17
What are you doing with `RESULT`? It looks like it's getting overwritten with each line, so you lose the previous results. — PM 2Ring, Jan 28 '15 at 05:19
I am trying to create list of first elements from that RESULT tuple: links = [] for HTTP_TUPLES in RESULT: links.append(HTTP_TUPLES[0]) — Vladimir, Jan 28 '15 at 05:21
BTW, it's usual to write simple variable names in Python in all lower case. UPPER CASE is used for module-level constants. See https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions — PM 2Ring, Jan 28 '15 at 05:22
@Vladimir: In that case, you should put that code in your question. — PM 2Ring, Jan 28 '15 at 05:24
You might wan't to check your regular expression groupings. I do not think they do what you intend them to do. Additional help for regular expression testing can be found here: https://www.regex101.com/#python. They also offer a debugging area where they step through the matching for you! — jakebird451, Jan 28 '15 at 05:25
here is sample https://www.dropbox.com/s/e7vev2nnqruowge/sample.gz?dl=0 Sorry, its quite big — Vladimir, Jan 28 '15 at 05:29
I test my regex here https://regex101.com/#python. Its works — Vladimir, Jan 28 '15 at 05:31
@Vladimir `http://aa` is not a valid url. You might want to check your regular expressions. — jakebird451, Jan 28 '15 at 05:40

vks · Answer 1 · 2023-02-08T04:49:23.050

1

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>'",]+|\(([^\s()<>'",]+|(\([^\s()<>'",]+\)))*\))+(?:\(([^\s()<>'",]+|(\([^\s()<>'",]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Try this.This will do it for you.See demo.

https://regex101.com/r/ib6eed/1

edited Feb 08 '23 at 04:49

answered Jan 28 '15 at 06:35

vks

67,027
10
91
124

Can you explain why you make those changes? – nhahtdh Jan 28 '15 at 07:46

score 1 · Accepted Answer · answered Jan 28 '15 at 08:15

I don't address the correctness of the regex in this answer. You might want to take a look at this article on URL validation and customize it for your matching task.

Problem

Your regex includes classical example of catastrophic backtracking in the form of (A*)*.

For example, in this portion:

(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+

If you throw away the second branch, you will immediately see the problem:

(?:[^\s()<>]+)+

The second branch also contains an instance of the problematic pattern:

([^\s()<>]+|(\([^\s()<>]+\)))*

degenerates to:

([^\s()<>]+)*

To demonstrate the problem you can test your regex on this non-matching string:

sdfsdf http://www/sdfsdfsdf(sdsdfsdfsdfsdfsdfsdf sfsdf(Sdfsdf)(sdfsdF)(sdfdsF)(<))sdsdfsf

Demo on regex101

Solution

Using the snippet above from your regex to demo:

(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+
            ^             ^

In languages which supports possessive quantifier, since the 2 branches of your regex are mutual exclusive, it is an option to make those quantifiers possessive.

However, since Python doesn't support possessive quantifier, you can remove the quantifiers at the positions marked without affecting the result, since it has been taken care of by the quantifier in the immediate outer layer.

The final result (which takes care of the same problem in the last group):

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Demo on regex101

score 0 · Answer 3 · answered Jan 28 '15 at 05:32

0

Try:

import re

URL_PATTERN = re.compile(ur'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''')

RESULT = []
with open("some_text_file") as RAW:
  map(lambda x:RESULT.extend(URL_PATTERN.findall(x)), RAW.xreadlines())

In Python 3, remove xreadlines(), as the file object itself is an iterator.

answered Jan 28 '15 at 05:32

belteshazzar

2,163
2
21
30

same result with sample of file in the above comment. – Vladimir Jan 28 '15 at 05:41
It's the regex. Tried it with different less accurate ones and they all returned results in 5 or so seconds. Just leave it running, it should finish in due time. – belteshazzar Jan 28 '15 at 07:17

Python infinite loop in regex to match url

3 Answers3

Problem

Solution

Linked