2

My question is similar to this one, but more complicated.

I am trying to figure out a regex to extract URLs from a text document. The tricky thing is that some of the URLs are embedded in sentences with harder to parse formatting. Here's an example of text I would like to extract URLs from:

<p>There are several links of the general format http://www.foo.com/index.html.</p>
<p>There are many websites (e.g. http://www.foo.com/abc/def?a=2&b=3) that end oddly: http://www.foo.com/results</p>

In these examples, the first URL has a sentence-ending period immediately following the link that needs to be excluded. The 2nd link has a right parenthesis at the end of the URL, and the third ends when it hits an HTML tag.

For my purposes, a period (and right parenthesis) is a valid URL character unless it is the very last character. The problem in short is how to deal with characters that are valid in a string, only if they are not the very last character in the string.

My current regex that is unable to deal with this case is (in Python):

m = re.findall("((http:|https:)//[^ \<]+)",line)

Any thoughts on elegant ways to deal with this?

Community
  • 1
  • 1
Raolin
  • 379
  • 1
  • 4
  • 14
  • 1
    Possibly you want one of these http://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link http://stackoverflow.com/questions/520031/whats-the-cleanest-way-to-extract-urls-from-a-string-using-python – daydreamer Dec 08 '11 at 20:03

1 Answers1

3

You can forbid period as the last symbol like that:

m = re.findall("((http:|https:)//[^ \<]*[^ \<\.])",line)
KL-7
  • 46,000
  • 9
  • 87
  • 74