I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version of a web page. Currently, I want to scrape reviews for a product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem
For this, I have the following code:
url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=6227215 6621&veh=sem
review_url = url
#print review_url
#-------------------------------------------------------------------------
# Scrape the ratings
#-------------------------------------------------------------------------
page_no = 1
sum_total_reviews = 0
more = True
while (more):
#print "XXXX"
# Open the URL to get the review data
request = urllib2.Request(review_url)
try:
#print "XXXX"
page = urllib2.urlopen(request)
except urllib2.URLError, e:
#print "XXXXX"
if hasattr(e, 'reason'):
print 'Failed to reach url'
print 'Reason: ', e.reason
sys.exit()
elif hasattr(e, 'code'):
if e.code == 404:
print 'Error: ', e.code
sys.exit()
content = page.read()
#print content
soup = BeautifulSoup(content)
results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')})
With this, I'm not able to read anything. I'm guessing I have to give it an accurate destination. Any suggestions ?