1

I am trying to extract URLs from text file and stuck in an infinite loop

import re

URL_PATTERN = re.compile(ur'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''')

with open("some_text_file") as RAW:
    for line in RAW:
        RESULT = URL_PATTERN.findall(line)
        links = []
        for HTTP_TUPLES in RESULT:
            links.append(HTTP_TUPLES[0])

How i can avoid that?

PS: Yes, i know about urllib and other modules

Vladimir
  • 15
  • 5
  • 2
    this doesnt look like an infinite loop........ – vks Jan 28 '15 at 05:15
  • debug the expression – akonsu Jan 28 '15 at 05:16
  • @vks, the OP seems to say that evaluating the regular expression enters a loop – akonsu Jan 28 '15 at 05:17
  • What are you doing with `RESULT`? It looks like it's getting overwritten with each line, so you lose the previous results. – PM 2Ring Jan 28 '15 at 05:19
  • I am trying to create list of first elements from that RESULT tuple: links = [] for HTTP_TUPLES in RESULT: links.append(HTTP_TUPLES[0]) – Vladimir Jan 28 '15 at 05:21
  • BTW, it's usual to write simple variable names in Python in all lower case. UPPER CASE is used for module-level constants. See https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions – PM 2Ring Jan 28 '15 at 05:22
  • @Vladimir: In that case, you should put that code in your question. – PM 2Ring Jan 28 '15 at 05:24
  • 2
    You might wan't to check your regular expression groupings. I do not think they do what you intend them to do. Additional help for regular expression testing can be found here: https://www.regex101.com/#python. They also offer a debugging area where they step through the matching for you! – jakebird451 Jan 28 '15 at 05:25
  • Can you post some sample lines – vks Jan 28 '15 at 05:26
  • here is sample https://www.dropbox.com/s/e7vev2nnqruowge/sample.gz?dl=0 Sorry, its quite big – Vladimir Jan 28 '15 at 05:29
  • I test my regex here https://regex101.com/#python. Its works – Vladimir Jan 28 '15 at 05:31
  • 2
    @Vladimir `http://aa` is not a valid url. You might want to check your regular expressions. – jakebird451 Jan 28 '15 at 05:40

3 Answers3

1
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>'",]+|\(([^\s()<>'",]+|(\([^\s()<>'",]+\)))*\))+(?:\(([^\s()<>'",]+|(\([^\s()<>'",]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Try this.This will do it for you.See demo.

https://regex101.com/r/ib6eed/1

vks
  • 67,027
  • 10
  • 91
  • 124
1

I don't address the correctness of the regex in this answer. You might want to take a look at this article on URL validation and customize it for your matching task.

Problem

Your regex includes classical example of catastrophic backtracking in the form of (A*)*.

For example, in this portion:

(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+

If you throw away the second branch, you will immediately see the problem:

(?:[^\s()<>]+)+

The second branch also contains an instance of the problematic pattern:

([^\s()<>]+|(\([^\s()<>]+\)))*

degenerates to:

([^\s()<>]+)*

To demonstrate the problem you can test your regex on this non-matching string:

sdfsdf http://www/sdfsdfsdf(sdsdfsdfsdfsdfsdfsdf sfsdf(Sdfsdf)(sdfsdF)(sdfdsF)(<))sdsdfsf

Demo on regex101

Solution

Using the snippet above from your regex to demo:

(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+
            ^             ^

In languages which supports possessive quantifier, since the 2 branches of your regex are mutual exclusive, it is an option to make those quantifiers possessive.

However, since Python doesn't support possessive quantifier, you can remove the quantifiers at the positions marked without affecting the result, since it has been taken care of by the quantifier in the immediate outer layer.

The final result (which takes care of the same problem in the last group):

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Demo on regex101

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
0

Try:

import re

URL_PATTERN = re.compile(ur'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''')

RESULT = []
with open("some_text_file") as RAW:
  map(lambda x:RESULT.extend(URL_PATTERN.findall(x)), RAW.xreadlines())

In Python 3, remove xreadlines(), as the file object itself is an iterator.

belteshazzar
  • 2,163
  • 2
  • 21
  • 30