python regular expressions: html

Question

I have a html file that contains a line:

a = '<li><a href="?id=11&amp;sort=&amp;indeks=0,3" class="">H</a></li>'

When I search:

re.findall(r'href="?(\S+)"', a)

I get expected output:

['?id=11&amp;sort=&amp;indeks=0,3']

However, when I add "i" to the pattern like:

re.findall(r'href="?i(\S+)"', a)

I get:

[ ]

Where's the catch? Thank you in advance.

You should use a parser instead of regex. http://stackoverflow.com/a/1732454/1219006 — jamylak, May 11 '12 at 14:09
While the above link is certainly true for parsing HTML, the question is asking to find lines containing `href=?`--a task sufficiently simple for regex, IMHO. Rather, using an HTML parser here could be considered overkill. — Brian Gesiak, May 11 '12 at 14:16

score 4 · Accepted Answer · answered May 11 '12 at 14:09

4

The problem is that the ? has a special meaning and is not being matched literally.

To fix, change your regex like so:

re.findall(r'href="\?i(\S+)"', a)

Otherwise, the ? is treated as the optional modified applied to the ". This happens to work (by accident) in your first example, but doesn't work in the second.

answered May 11 '12 at 14:09

NPE

486,780
108
951
1,012

Thank you for pointing out my mistake. However I do need to search patterns that I have already extracted from elsewhere, so it would be time consuming to change the pattern by hand. Is there a way to make python disregard "?" characters (and others with special meaning) in the pattern? – root May 11 '12 at 14:15
1

@priilane: Yes, by escaping the pattern first. http://docs.python.org/library/re.html#re.escape – mpen May 11 '12 at 15:18

score 4 · Answer 2 · answered May 11 '12 at 14:22

I personally think that Python's built-in HTMLParser is incredibly useful for cases like these. I don't think this is overkill at all -- I think it's vastly more readable and maintainable than a regex.

>>> class HrefExtractor(HTMLParser.HTMLParser):
...     def handle_starttag(self, tag, attrs):
...         if tag == 'a':
...             attrs = dict(attrs)
...             if 'href' in attrs:
...                 print attrs['href']
... 
>>> he = HrefExtractor()
>>> he.feed('<a href=foofoofoo>')
foofoofoo

score 0 · Answer 3 · answered May 11 '12 at 15:13

Catch here is that ? has a special meaning in regexes, it defines zero or one occurrence of anything. So, if you want the href value from the <a> tag, you should be using -

re.findall(r'href="(\?\S+)"', a)

and not

re.findall(r'href="?(\S+)"', a)

So, if you're not using ?'s special meaning, the you should escape it like \? or use it like ab? which says either a or b. Your way of using ? is improper.

python regular expressions: html

3 Answers3