Having trouble accessing xpath attribute with scrapy

Question

I am currently trying to scrape the following url: http://www.bedbathandbeyond.com/store/product/dyson-dc59-motorhead-cordless-vacuum/1042997979?categoryId=10562

On this page, I want to extract the number of reviews listed. That is, I want to extract the number 693.

This is my current xpath:

sel.xpath('//*[@id="BVRRRatingSummaryLinkReadID"]/a/span/span')

It seems to be only returning an empty array, can someone suggest a correct xpath?

score 4 · Accepted Answer · edited May 23 '17 at 12:19

There are no reviews on the initial page you are getting with Scrapy. The problem is that the reviews are loaded and constructed via the heavy use of javascript which makes things more complicated.

Basically, your options are:

a high-level approach (for example, use a real browser with selenium). You can even combine Scrapy and Selenium:
a middle-level approach: scrapy + scrapyjs
a low-level approach (find out where the reviews are constructed and get them)

Here is a working example of the low-level approach involving parsing of a javascript code with json and slimit, extracting HTML from it and parsing it via BeautifulSoup:

import json

from bs4 import BeautifulSoup
import requests
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

ID = 1042997979

url = 'http://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/{id}/reviews.djs?format=embeddedhtml&sort=submissionTime'.format(id=ID)

response = requests.get(url)

parser = Parser()
tree = parser.parse(response.content)
data = ""
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.Object):
        data = json.loads(node.to_ecma())
        if "BVRRSourceID" in data:
            break

soup = BeautifulSoup(data['BVRRSourceID'])
print soup.select('span.BVRRCount span.BVRRNumber')[0].text

Prints 693.

To adapt the solution to Scrapy, you would need to make a request with Scrapy instead of requests, and parse the HTML with Scrapy instead of BeautifulSoup.

score 0 · Answer 2 · answered Dec 11 '14 at 16:00

0

You cannot do that. If you merely crawled the html from this url, you won't find any string of 693. This content must be created dynamically by some AJAX code.

answered Dec 11 '14 at 16:00

fwu

359
2
10

So there is no way to get the value of response from the HTML? – yesyouken Dec 11 '14 at 16:31
You may need to leverage on Webkit or similar stuff to render the web page first. That would be more complicate. – fwu Dec 11 '14 at 16:37

Having trouble accessing xpath attribute with scrapy

2 Answers2

Linked