Regular expression to extract URLs with difficult formatting

Question

My question is similar to this one, but more complicated.

I am trying to figure out a regex to extract URLs from a text document. The tricky thing is that some of the URLs are embedded in sentences with harder to parse formatting. Here's an example of text I would like to extract URLs from:

<p>There are several links of the general format http://www.foo.com/index.html.</p>
<p>There are many websites (e.g. http://www.foo.com/abc/def?a=2&b=3) that end oddly: http://www.foo.com/results</p>

In these examples, the first URL has a sentence-ending period immediately following the link that needs to be excluded. The 2nd link has a right parenthesis at the end of the URL, and the third ends when it hits an HTML tag.

For my purposes, a period (and right parenthesis) is a valid URL character unless it is the very last character. The problem in short is how to deal with characters that are valid in a string, only if they are not the very last character in the string.

My current regex that is unable to deal with this case is (in Python):

m = re.findall("((http:|https:)//[^ \<]+)",line)

Any thoughts on elegant ways to deal with this?

Possibly you want one of these http://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link http://stackoverflow.com/questions/520031/whats-the-cleanest-way-to-extract-urls-from-a-string-using-python — daydreamer, Dec 08 '11 at 20:03

score 3 · Accepted Answer · answered Dec 08 '11 at 19:59

3

You can forbid period as the last symbol like that:

m = re.findall("((http:|https:)//[^ \<]*[^ \<\.])",line)

answered Dec 08 '11 at 19:59

KL-7

46,000
9
87
74

Wow, I definitely did not think of that. Very simple solution. Thanks! – Raolin Dec 08 '11 at 20:56

Regular expression to extract URLs with difficult formatting

1 Answers1