How to extract the URL from this HTML tag?

Question

I'm trying to get all URLs with id='revSAR' from the HTML tag below, using a Python regex:

<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>

I tried the code below, but it's not working (it prints nothing):

regex = b'<a id="revSAR" href="(.+?)" class="txtsmall noTextDecoration">(.+?)</a>'
pattern=re.compile(regex)
rev_url=re.findall(pattern,txt)
print ('reviews url: ' + str(rev_url))

Example of parsing `a` links with Beautiful Soup: https://groups.google.com/forum/?fromgroups#!topic/beautifulsoup/8TbctreqvSI — Paul, Aug 20 '13 at 06:57
Or http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup — Paul, Aug 20 '13 at 07:00

score 1 · Answer 1 · answered Aug 20 '13 at 05:55

1

You could try something like

(_, url), = re.findall(r'href=([\'"]*)(\S+)\1', input)
print url

However, personally I'd rather use a HTML parsing library like BeautifulSoup for a task like this.

answered Aug 20 '13 at 05:55

Marc Liyanage

4,601
2
28
28

Will BeautiflSoup work for Windows? How do i install and setup under python 33 and make it work? – Vijay Kumar Aug 20 '13 at 06:01
I'm not on Windows so I've never done it, but this post seems to have tips for installing BeautifulSoup on Windows: [How to install beautiful soup 4 with python 2.7 on windows](http://stackoverflow.com/questions/12228102/how-to-install-beautiful-soup-4-with-python-2-7-on-windows) – Marc Liyanage Aug 20 '13 at 06:28

score 0 · Answer 2 · answered Aug 20 '13 at 05:50

0

You don't need to match those unnecessary parts like id=..., href=..., Try this:

regex = 'http://.*\'\s+'

answered Aug 20 '13 at 05:50

WoooHaaaa

19,732
32
90
138

As there are several urls in amazon product reviews page, i would like to extract only the url for the tag starting with this id – Vijay Kumar Aug 20 '13 at 06:00

score 0 · Answer 3 · edited May 23 '17 at 11:56

0

First, why your regex didn't worked? In your html the attributes are quoted using single quotes where as in regex its double quotes. And you only need to care about href attribute. Try some thing as href=['"](.+?)['"] as regex and it would be better if you use ignore case switch

But again its a very bad decision to parse the html using regex. Please go through this

edited May 23 '17 at 11:56

Community

1
1

answered Aug 20 '13 at 06:02

Jithin

2,594
1
22
42

score 0 · Answer 4 · answered Aug 20 '13 at 14:30

Description

This exprssion will:

find anchor tags
require the anchor tag to have the id attribute with value revSAR
will capture the href attribute value, not including any surrounding quotes if they exist
will capture the inner text, and trim the white space
will allow the attributes to appear in any order
allow attributes to have double quoted, single quotes, or no quotes
avoid many of the edge cases which frequently trip up regular expressions when pattern matching html

<a(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)revSAR\1(?:\s|>)) (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=(['"]?)(.*?)\2(?:\s|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(.*?)\s*<\/a>

enter image description here

Examples

Live Demo

Sample Text

Note the first couple anchor tags here have some really difficult edge cases.

<a onmouseover=' id="revSAR" ; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; '  href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  You shouldn't find me
</a>



<a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 111 customer reviews
</a>


<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>

Matches

Group 0 gets the entire anchor tag
Group 1 gets the quote surrounding the id attribute which is used later to find the correct closing quote
Group 2 gets the quote surrounding the href attribute which is used later to find the correct closing quote
Group 3 gets the href attribute value, not including any quotes Group 4 gets the inner text, not including any surrounding whitespace

[0][0] = <a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 111 customer reviews
</a>
[0][1] = '
[0][2] = '
[0][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending
[0][4] = See all 111 customer reviews


[1][0] = <a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>
[1][1] = '
[1][2] = '
[1][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending
[1][4] = See all 136 customer reviews

How to extract the URL from this HTML tag?

4 Answers4

Description

Examples