Regular expression to extract a specific value from HTML anchors

Question

I am trying to extract http://xyz.com/5 link from the string below. You can see that only for that one we have the class="next" attribute. So I am trying to get that based on this attribute.

<a href='http://xyz.com/1' class='page larger'>2</a>
<a href='http://xyz.com/2' class='page larger'>3</a>
<a href='http://xyz.com/3' class='page larger'>4</a>
<a href='http://xyz.com/4' class='page larger'>5</a>
<a href='http://xyz.com/5' class="next">»</a>

I tried below pattern but this returns all links in the entire text.

<a href='(.+?)' class="next">

(I understand from this site that using regular expressions to parse HTML is a bad idea, but I have to do this for now.)

I agree that you shouldn't use regex to parse HTML. However, your regex worked for me (in multi-line mode). Depending on how you are running this you may have to escape the < > signs. — Andy G, Jun 30 '13 at 02:25
well, if I have to use DOM, I need to re-write many stuffs in the program. If I can get this pattern, in one line I can finish this. This is for an XBMC video plugin if in case anyone would like to know. — shibin, Jun 30 '13 at 02:27
@shibin When you say that the regex returns all the links, what did you try that produced this? — TerryA, Jun 30 '13 at 02:28
ok, this is actually a part of a complete HTML page. I used below Python code which returns the whole HTML. match=re.compile('').findall(link) — shibin, Jun 30 '13 at 02:31

score 2 · Answer 1 · edited May 23 '17 at 11:50

2

Please don't use regex to parse HTML. Use something like BeautifulSoup. It's so much easier and better :p

from bs4 import BeautifulSoup as BS
html = """<a href='http://xyz.com/1' class='page larger'>2</a>
<a href='http://xyz.com/2' class='page larger'>3</a>
<a href='http://xyz.com/3' class='page larger'>4</a>
<a href='http://xyz.com/4' class='page larger'>5</a>
<a href='http://xyz.com/5' class="next">»</a>"""
soup = BS(html)
for atag in soup.find_all('a', {'class':'next'}):
    print atag['href']

With your example, this prints:

http://xyz.com/5

Also, your regular expression works fine.

edited May 23 '17 at 11:50

Community

1
1

answered Jun 30 '13 at 02:21

TerryA

58,805
11
114
143

thanks for your answer and I appreciate it. Actually there are no new line characters, the entire string is just one. for the readability I just made that. As there are new line characters in the sample I have given, it just worked in the link you have provided ("works fine"). :) Also I think if I use my pattern with Python even if there are new line characters it will still return all as I am using "findall". – shibin Jun 30 '13 at 02:45

score 2 · Accepted Answer · answered Jun 30 '13 at 02:27

Try this regexp:

<a href='([^']+)' class="next">

Making a regular expression non-greedy doesn't mean it will always find the shortest match. It just means that once it has found a match it will return it, it won't keep looking for a longer match. Put another way, it will uses the shortest match at the right-hand end of the wildcard, but not the left-hand side.

So your regular expression was matching at the beginning of the first link, and continuing until it found class = "next". Instead of using .+?, using [^']+ means that the wildcard will not cross attribute boundaries, so you're assured of matching just one link.

Thanks Barmar. This exactly what I am looking for and completely answers my question. Thank you!! — shibin, Jun 30 '13 at 02:32

Regular expression to extract a specific value from HTML anchors

2 Answers2