1

I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version of a web page. Currently, I want to scrape reviews for a product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem

For this, I have the following code:

url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=6227215 6621&veh=sem
review_url = url       
#print review_url
    #-------------------------------------------------------------------------
    # Scrape the ratings
    #-------------------------------------------------------------------------
    page_no = 1
    sum_total_reviews = 0
    more = True

    while (more):
        #print "XXXX"
        # Open the URL to get the review data
        request = urllib2.Request(review_url)
        try:
            #print "XXXX"
            page = urllib2.urlopen(request)
        except urllib2.URLError, e:
            #print "XXXXX"
            if hasattr(e, 'reason'):
                print 'Failed to reach url'
                print 'Reason: ', e.reason
                sys.exit()
            elif hasattr(e, 'code'):
                if e.code == 404:
                    print 'Error: ', e.code
                    sys.exit()

        content = page.read()
        #print content
        soup = BeautifulSoup(content)
        results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')})

With this, I'm not able to read anything. I'm guessing I have to give it an accurate destination. Any suggestions ?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Aks
  • 932
  • 2
  • 17
  • 32

2 Answers2

1

The reviews are loaded using AJAX call. You can not find those on the link that you provided. The reviews are loaded from the following link:

http://walmart.ugc.bazaarvoice.com/1336/29701960/reviews.djs?format=embeddedhtml&dir=desc&sort=relevancy

Here 29701960 is found from the html source of your current source like this way:

<meta property="og:url" content="http://www.walmart.com/ip/29701960" />
                                                           +------+ this one

or

trackProductId : '29701960',
                  +------+ or this one

And 1336 is from the source:

WALMART.BV.scriptPath =  'http://walmart.ugc.bazaarvoice.com/static/1336/';
                                                                    +--+ here

Using the values, build the Above url and parse the data from there using BeautifulSoup.

Sabuj Hassan
  • 38,281
  • 14
  • 75
  • 85
  • @Sabuj, Thanks for your feedback, looks like that link shows the reviews, but I agree with Alecxe's comment as well. I merely wanted to look for a way to get a correct destination that I can use in my results statement. results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')}) – Aks Mar 23 '14 at 19:58
  • Which means this hasn't solved the issue. Any suggestions would be great – Aks Mar 23 '14 at 21:03
  • @user2690931 IMO, your approach seemed wrong to me. That's why I have suggested my solution to you. If you can not follow that, then you have to wait for others to enlighten you! – Sabuj Hassan Mar 23 '14 at 21:10
  • @user2690931: In general, you could [use selenium to download a page generated using javascript](http://stackoverflow.com/q/8960288/4279). Though in this case, it might be simpler to use @Sabuj Hassan's suggestion. If you don't know how to extract `trackProductId` number (`29701960`) or the number from `scriptPath` (`1336`), [update your question](http://stackoverflow.com/posts/22595693/edit) or [ask a new one](http://stackoverflow.com/questions/ask) – jfs Mar 23 '14 at 22:56
1

I understand that the question was initially about BeautifulSoup, but since you haven't had any success using it in this particular situation, I suggest taking a look at selenium.

Selenium uses a real browser - you don't have to deal with parsing the results of ajax calls. For example, here's how you can get the list of review titles and ratings from the first reviews page:

from selenium.webdriver.firefox import webdriver


driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')

for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
    title = review.find_element_by_class_name('BVRRReviewTitle').text
    rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
    print title, rating

driver.close()

prints:

Renee Culver loves Clorox Wipes 5 out of 5
Men at work 5 out of 5
clorox wipes 5 out of 5
...

Also, take into account that you can use a headless PhantomJS browser (example).


Another option is to make use of Walmart API.

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks for this @alecxe. I was able to pull the review titles, review text, date & location, which I wanted. Since I'm new to selenium, could you let me know what I have to do in order to pull the 'ratings' as well ? It seems to be wrapped inside additional layers which I can't seem to comprehend as I'm new to html file systems. Appreciate it. Thanks. – Aks Mar 30 '14 at 00:37
  • I would like to disturb you over one final issue. Is there a way we can recursively scrape all the page numbers. It's currently just scraping the content it can "see" on the first page. You may have noticed that the url remains constant. Is there a way ? Do help. Thanks. – Aks Apr 05 '14 at 08:28
  • @Anuj yeah, you need to get page numbers first, then parse reviews page by page. Consider creating a separate question if you need help. Thanks. – alecxe Apr 06 '14 at 02:13