0

I'm trying to get all URLs with id='revSAR' from the HTML tag below, using a Python regex:

<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>

I tried the code below, but it's not working (it prints nothing):

regex = b'<a id="revSAR" href="(.+?)" class="txtsmall noTextDecoration">(.+?)</a>'
pattern=re.compile(regex)
rev_url=re.findall(pattern,txt)
print ('reviews url: ' + str(rev_url))
Priya Ranjan Singh
  • 1,567
  • 1
  • 15
  • 29
  • Example of parsing `a` links with Beautiful Soup: https://groups.google.com/forum/?fromgroups#!topic/beautifulsoup/8TbctreqvSI – Paul Aug 20 '13 at 06:57
  • Or http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup – Paul Aug 20 '13 at 07:00

4 Answers4

1

You could try something like

(_, url), = re.findall(r'href=([\'"]*)(\S+)\1', input)
print url

However, personally I'd rather use a HTML parsing library like BeautifulSoup for a task like this.

Marc Liyanage
  • 4,601
  • 2
  • 28
  • 28
  • Will BeautiflSoup work for Windows? How do i install and setup under python 33 and make it work? – Vijay Kumar Aug 20 '13 at 06:01
  • I'm not on Windows so I've never done it, but this post seems to have tips for installing BeautifulSoup on Windows: [How to install beautiful soup 4 with python 2.7 on windows](http://stackoverflow.com/questions/12228102/how-to-install-beautiful-soup-4-with-python-2-7-on-windows) – Marc Liyanage Aug 20 '13 at 06:28
0

You don't need to match those unnecessary parts like id=..., href=..., Try this:

regex = 'http://.*\'\s+'

WoooHaaaa
  • 19,732
  • 32
  • 90
  • 138
  • As there are several urls in amazon product reviews page, i would like to extract only the url for the tag starting with this id – Vijay Kumar Aug 20 '13 at 06:00
0

First, why your regex didn't worked? In your html the attributes are quoted using single quotes where as in regex its double quotes. And you only need to care about href attribute. Try some thing as href=['"](.+?)['"] as regex and it would be better if you use ignore case switch

But again its a very bad decision to parse the html using regex. Please go through this

Community
  • 1
  • 1
Jithin
  • 2,594
  • 1
  • 22
  • 42
0

Description

This exprssion will:

  • find anchor tags
  • require the anchor tag to have the id attribute with value revSAR
  • will capture the href attribute value, not including any surrounding quotes if they exist
  • will capture the inner text, and trim the white space
  • will allow the attributes to appear in any order
  • allow attributes to have double quoted, single quotes, or no quotes
  • avoid many of the edge cases which frequently trip up regular expressions when pattern matching html

<a(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)revSAR\1(?:\s|>)) (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=(['"]?)(.*?)\2(?:\s|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(.*?)\s*<\/a>

enter image description here

Examples

Live Demo

Sample Text

Note the first couple anchor tags here have some really difficult edge cases.

<a onmouseover=' id="revSAR" ; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; '  href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  You shouldn't find me
</a>



<a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 111 customer reviews
</a>


<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>

Matches

Group 0 gets the entire anchor tag
Group 1 gets the quote surrounding the id attribute which is used later to find the correct closing quote
Group 2 gets the quote surrounding the href attribute which is used later to find the correct closing quote
Group 3 gets the href attribute value, not including any quotes Group 4 gets the inner text, not including any surrounding whitespace

[0][0] = <a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 111 customer reviews
</a>
[0][1] = '
[0][2] = '
[0][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending
[0][4] = See all 111 customer reviews


[1][0] = <a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>
[1][1] = '
[1][2] = '
[1][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending
[1][4] = See all 136 customer reviews
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43