I'm trying to extract URLs that are within and match both tags that have a close as well as open/unclosed that have hrefs in them.
That said here is the regex:
<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?
Here is some sample data:
<link href='http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5' /><table><tr><td>
<a href='http://blah.net/message/new/'>Click here and submit your updated information </a> <br><br>Thanking you in advance for your attention to this matter.<br><br>
Regards, <br>
Debbi Hamilton
</td></tr><tr><td><br><br></td></tr></table>
And putting this in http://re-try.appspot.com/ or http://www.regexplanet.com/advanced/java/index.html (yes I know it's for java) yields precisely what I am trying to get: the tag, the href text, the enclosed text with the end tag, and the enclosed text by itself.
That said, when I use this in my python app, the last two groups (enclosed text w/ tag, and enclosed text by itself) are always None
. I suspect it has something to do with the group within a group with a back reference: ((.+?))?
Also, I should mention that I specifically use: matcher = re.compile(...) matcher.findall(data)
but the groups being None
appears in both matcher.search(data)
and matcher.match(data)
Any help would be greatly appreciated!