0

I'm trying to extract URLs that are within and match both tags that have a close as well as open/unclosed that have hrefs in them.

That said here is the regex:

<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?

Here is some sample data:

<link href='http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5' /><table><tr><td>
<a href='http://blah.net/message/new/'>Click here and submit your updated information </a> <br><br>Thanking you in advance for your attention to this matter.<br><br>

Regards, <br>
Debbi Hamilton
</td></tr><tr><td><br><br></td></tr></table>

And putting this in http://re-try.appspot.com/ or http://www.regexplanet.com/advanced/java/index.html (yes I know it's for java) yields precisely what I am trying to get: the tag, the href text, the enclosed text with the end tag, and the enclosed text by itself.

That said, when I use this in my python app, the last two groups (enclosed text w/ tag, and enclosed text by itself) are always None. I suspect it has something to do with the group within a group with a back reference: ((.+?))?

Also, I should mention that I specifically use:
    matcher = re.compile(...)
    matcher.findall(data)

but the groups being None appears in both matcher.search(data) and matcher.match(data)

Any help would be greatly appreciated!

Wilduck
  • 13,822
  • 10
  • 58
  • 90
lanthica
  • 250
  • 1
  • 9

2 Answers2

1

Respectfully, what you want to do is very silly, and you shouldn't do it.

That said, it seems to work for me (by which I mean gives non-None results):

>>> reg = r'<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?'
... 
>>> d = """
<link href='http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5' /><table><tr><td>
<a href='http://blah.net/message/new/'>Click here and submit your updated information </a> <br><br>Thanking you in advance for your attention to this matter.<br><br>
Regards, <br>
Debbi Hamilton
</td></tr><tr><td><br><br></td></tr></table>
"""
>>> 
>>> re.findall(reg, d)
[('link', 'http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5', '', ''), 
('a', 'http://blah.net/message/new/', 'Click here and submit your updated information </a>', 'Click here and submit your updated information ')]

My guess is that you forgot to use a raw string when making the regular expression, i.e.

>>> reg = '<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?'
... 
>>> re.findall(reg, d)
[('link', 'http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5', '', ''), 
('a', 'http://blah.net/message/new/', '', '')]
DSM
  • 342,061
  • 65
  • 592
  • 494
  • Thanks I can't believe I forgot about that. I realize trying to parse HTML via regex is horrible, that said, what I'm using this for is more of proof of concept with extracting some URLs within tags. – lanthica Feb 12 '13 at 00:22
1
pat = ('<'
       '(\w+)\s[^<>]*?'
       'href='
       '([\'"])'
       '([\w$-_.+!*\'(\),%/:#=?~[\]!&@;]*?)'
       '(?:\\2)'
       '.*?'
       '>'
       '((.+?)</\\1>)?')

You just needed to put \\1 or r'...' as did DSM

Note that I made minor modifications in your pattern: there were two !
writing [\] instead of \[\] because it's clear for the regex machinery that [ after a first [ is a simple character
the same for (\)

Note that I did a group of ([\'"]) and put (?:\\2) to catch the same at the end

eyquem
  • 26,771
  • 7
  • 38
  • 46