2

I have a html file that contains a line:

a = '<li><a href="?id=11&amp;sort=&amp;indeks=0,3" class="">H</a></li>'

When I search:

re.findall(r'href="?(\S+)"', a)

I get expected output:

['?id=11&amp;sort=&amp;indeks=0,3']

However, when I add "i" to the pattern like:

re.findall(r'href="?i(\S+)"', a)

I get:

[ ]

Where's the catch? Thank you in advance.

root
  • 76,608
  • 25
  • 108
  • 120
  • 9
    You should use a parser instead of regex. http://stackoverflow.com/a/1732454/1219006 – jamylak May 11 '12 at 14:09
  • 2
    While the above link is certainly true for parsing HTML, the question is asking to find lines containing `href=?`--a task sufficiently simple for regex, IMHO. Rather, using an HTML parser here could be considered overkill. – Brian Gesiak May 11 '12 at 14:16

3 Answers3

4

The problem is that the ? has a special meaning and is not being matched literally.

To fix, change your regex like so:

re.findall(r'href="\?i(\S+)"', a)

Otherwise, the ? is treated as the optional modified applied to the ". This happens to work (by accident) in your first example, but doesn't work in the second.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • Thank you for pointing out my mistake. However I do need to search patterns that I have already extracted from elsewhere, so it would be time consuming to change the pattern by hand. Is there a way to make python disregard "?" characters (and others with special meaning) in the pattern? – root May 11 '12 at 14:15
  • 1
    @priilane: Yes, by escaping the pattern first. http://docs.python.org/library/re.html#re.escape – mpen May 11 '12 at 15:18
4

I personally think that Python's built-in HTMLParser is incredibly useful for cases like these. I don't think this is overkill at all -- I think it's vastly more readable and maintainable than a regex.

>>> class HrefExtractor(HTMLParser.HTMLParser):
...     def handle_starttag(self, tag, attrs):
...         if tag == 'a':
...             attrs = dict(attrs)
...             if 'href' in attrs:
...                 print attrs['href']
... 
>>> he = HrefExtractor()
>>> he.feed('<a href=foofoofoo>')
foofoofoo
senderle
  • 145,869
  • 36
  • 209
  • 233
0

Catch here is that ? has a special meaning in regexes, it defines zero or one occurrence of anything. So, if you want the href value from the <a> tag, you should be using -

re.findall(r'href="(\?\S+)"', a)

and not

re.findall(r'href="?(\S+)"', a)

So, if you're not using ?'s special meaning, the you should escape it like \? or use it like ab? which says either a or b. Your way of using ? is improper.

theharshest
  • 7,767
  • 11
  • 41
  • 51