2

Hi I have a regex expression
<a href="(.+?)" class="nextpostslink">

This Regex works fine on the following html
'> <span class='pages'>Page 1 of 12</span><span class='current'>1</span><a href='http://cinemassacre.com/category/avgn/page/2/' class='page larger'>2</a><a href='http://cinemassacre.com/category/avgn/page/3/' class='page larger'>3</a><a href='http://cinemassacre.com/category/avgn/page/4/' class='page larger'>4</a><a href='http://cinemassacre.com/category/avgn/page/5/' class='page larger'>5</a><a href="http://cinemassacre.com/category/avgn/page/2/" class="nextpostslink">&raquo;</a><span class='extend'>...</span><a href='http://cinemassacre.com/category/avgn/page/12/' class='last'>Last &raquo;</a> </div> </div>

The part I am trying to extract is the next page url from
<a href="http://cinemassacre.com/category/avgn/page/2/" class="nextpostslink">

But when I run this regex on this block of HTML
'> <span class='pages'>Page 2 of 12</span><a href="http://cinemassacre.com/category/avgn/" class="previouspostslink">&laquo;</a><a href='http://cinemassacre.com/category/avgn/' class='page smaller'>1</a><span class='current'>2</span><a href='http://cinemassacre.com/category/avgn/page/3/' class='page larger'>3</a><a href='http://cinemassacre.com/category/avgn/page/4/' class='page larger'>4</a><a href='http://cinemassacre.com/category/avgn/page/5/' class='page larger'>5</a><a href="http://cinemassacre.com/category/avgn/page/3/" class="nextpostslink">&raquo;</a><span class='extend'>...</span><a href='http://cinemassacre.com/category/avgn/page/12/' class='last'>Last &raquo;</a> </div>
</div>


It extracts everything from the first <a href=" to " class="nextpostslink">
Why does this happen? I thought (.+?) was non greedy, so it should extract the minimal amount.
Which should be <a href="http://cinemassacre.com/category/avgn/page/3/" class="nextpostslink">

The complete python code im using is
match=re.compile('<a href="(.+?)" class="nextpostslink">', re.DOTALL).findall(pagenav)

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
Kr0nZ
  • 95
  • 3
  • 9
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – phihag Dec 04 '12 at 19:47
  • 1
    Ouch... Any reason why you're not using lxml, or beautiful soup? –  Dec 04 '12 at 20:02

3 Answers3

3

The start of your match is always greedy in a sense. That is because the engine attempts matches from left to right in your subject string. The first <a href=" is encountered, which is fine, and then the engine goes ahead and consumes everything with .+? until the match is completed (it stops as soon as possible, due to the .+?). But it does not try to start the match as far right as possible, because the match is just fine. Hence, you could say using ? makes the end of the match ungreedy (taking the first possible end of the match), but the start of the match will always be greedy (the match will always begin at the leftmost possible position, no matter how you try to make it ungreedy).

This is why there is often a better alternative to ungreedy repetition: exclude the delimiter from the repetition:

<a href="([^"]*)" class="nextpostslink">

This can never go past the closing ", so there is no need to worry that anything outside of the attribute or tag will be part of the match.

Let me add anyway, that you should not use regular expressions to parse HTML. What if ' is used instead of " (as in your second anchor tag in the given example)? What if there are multiple spaces between your attributes? What if there are more attributes than just href and class? What if the class attribute is listed before the href attribute?

jdotjdot's answer has a good example of how to do it the right way in Python.

Community
  • 1
  • 1
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Ahh I thought it might be something like that, but it worked fine the first time around (on the first block), so I thought that couldn't be the reason. Thanks [^"] works great. – Kr0nZ Dec 04 '12 at 19:53
3

As I understand it, the greediness works from the beginning of the regex--i.e., it finds <a href=", and then the non-greediness has it stop at the first " class="nextpostslink"> instead of the last one, like the greedy version would do.

You're best off using BeautifulSoup here:

from bs4 import BeautifulSoup as BS
soup = BS(html)
print soup.find("a", "nextpostslink").attrs['href']
# returns u'http://cinemassacre.com/category/avgn/page/2/'
jdotjdot
  • 16,134
  • 13
  • 66
  • 118
  • Thanks I'll give BeautifulSoup a shot. My aim is to make this a plugin for XBMC, so I was trying to avoid installing any additional modules. But I now see xbmc does include a BeautifulSoup and a ParseDOM module, so I'll try giving one of those a try. – Kr0nZ Dec 04 '12 at 20:07
  • `lxml` is way faster, so consider that also. – jdotjdot Dec 04 '12 at 20:08
1

It extracts everything from the first Why does this happen? I thought (.+?) was non greedy, so it should extract the minimal amount

It is non-greedy. However, the fact that you have a mandatory class="nextpostslink"> regex forces the engine to match everything until it finds class="nextpostslink">.

NPE
  • 486,780
  • 108
  • 951
  • 1,012