My question is similar to this one, but more complicated.
I am trying to figure out a regex to extract URLs from a text document. The tricky thing is that some of the URLs are embedded in sentences with harder to parse formatting. Here's an example of text I would like to extract URLs from:
<p>There are several links of the general format http://www.foo.com/index.html.</p>
<p>There are many websites (e.g. http://www.foo.com/abc/def?a=2&b=3) that end oddly: http://www.foo.com/results</p>
In these examples, the first URL has a sentence-ending period immediately following the link that needs to be excluded. The 2nd link has a right parenthesis at the end of the URL, and the third ends when it hits an HTML tag.
For my purposes, a period (and right parenthesis) is a valid URL character unless it is the very last character. The problem in short is how to deal with characters that are valid in a string, only if they are not the very last character in the string.
My current regex that is unable to deal with this case is (in Python):
m = re.findall("((http:|https:)//[^ \<]+)",line)
Any thoughts on elegant ways to deal with this?