0

I am trying to extract http://xyz.com/5 link from the string below. You can see that only for that one we have the class="next" attribute. So I am trying to get that based on this attribute.

<a href='http://xyz.com/1' class='page larger'>2</a>
<a href='http://xyz.com/2' class='page larger'>3</a>
<a href='http://xyz.com/3' class='page larger'>4</a>
<a href='http://xyz.com/4' class='page larger'>5</a>
<a href='http://xyz.com/5' class="next">»</a>

I tried below pattern but this returns all links in the entire text.

<a href='(.+?)' class="next">

(I understand from this site that using regular expressions to parse HTML is a bad idea, but I have to do this for now.)

shibin
  • 73
  • 8
  • Why do you "need to"? – TerryA Jun 30 '13 at 02:22
  • 1
    I agree that you shouldn't use regex to parse HTML. However, your regex worked for me (in multi-line mode). Depending on how you are running this you may have to escape the < > signs. – Andy G Jun 30 '13 at 02:25
  • @AndyG I noticed this too. http://regexr.com?35dan – TerryA Jun 30 '13 at 02:26
  • well, if I have to use DOM, I need to re-write many stuffs in the program. If I can get this pattern, in one line I can finish this. This is for an XBMC video plugin if in case anyone would like to know. – shibin Jun 30 '13 at 02:27
  • @shibin When you say that the regex returns all the links, what did you try that produced this? – TerryA Jun 30 '13 at 02:28
  • ok, this is actually a part of a complete HTML page. I used below Python code which returns the whole HTML. match=re.compile(' – shibin Jun 30 '13 at 02:31

2 Answers2

2

Please don't use regex to parse HTML. Use something like BeautifulSoup. It's so much easier and better :p

from bs4 import BeautifulSoup as BS
html = """<a href='http://xyz.com/1' class='page larger'>2</a>
<a href='http://xyz.com/2' class='page larger'>3</a>
<a href='http://xyz.com/3' class='page larger'>4</a>
<a href='http://xyz.com/4' class='page larger'>5</a>
<a href='http://xyz.com/5' class="next">»</a>"""
soup = BS(html)
for atag in soup.find_all('a', {'class':'next'}):
    print atag['href']

With your example, this prints:

http://xyz.com/5

Also, your regular expression works fine.

Community
  • 1
  • 1
TerryA
  • 58,805
  • 11
  • 114
  • 143
  • thanks for your answer and I appreciate it. Actually there are no new line characters, the entire string is just one. for the readability I just made that. As there are new line characters in the sample I have given, it just worked in the link you have provided ("works fine"). :) Also I think if I use my pattern with Python even if there are new line characters it will still return all as I am using "findall". – shibin Jun 30 '13 at 02:45
2

Try this regexp:

<a href='([^']+)' class="next">

Making a regular expression non-greedy doesn't mean it will always find the shortest match. It just means that once it has found a match it will return it, it won't keep looking for a longer match. Put another way, it will uses the shortest match at the right-hand end of the wildcard, but not the left-hand side.

So your regular expression was matching at the beginning of the first link, and continuing until it found class = "next". Instead of using .+?, using [^']+ means that the wildcard will not cross attribute boundaries, so you're assured of matching just one link.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Thanks Barmar. This exactly what I am looking for and completely answers my question. Thank you!! – shibin Jun 30 '13 at 02:32