There are no reviews on the initial page you are getting with Scrapy. The problem is that the reviews are loaded and constructed via the heavy use of javascript which makes things more complicated.
Basically, your options are:
- a high-level approach (for example, use a real browser with
selenium
). You can even combine Scrapy and Selenium:
- a middle-level approach:
scrapy
+ scrapyjs
- a low-level approach (find out where the reviews are constructed and get them)
Here is a working example of the low-level approach involving parsing of a javascript code with json
and slimit
, extracting HTML from it and parsing it via BeautifulSoup
:
import json
from bs4 import BeautifulSoup
import requests
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
ID = 1042997979
url = 'http://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/{id}/reviews.djs?format=embeddedhtml&sort=submissionTime'.format(id=ID)
response = requests.get(url)
parser = Parser()
tree = parser.parse(response.content)
data = ""
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Object):
data = json.loads(node.to_ecma())
if "BVRRSourceID" in data:
break
soup = BeautifulSoup(data['BVRRSourceID'])
print soup.select('span.BVRRCount span.BVRRNumber')[0].text
Prints 693
.
To adapt the solution to Scrapy, you would need to make a request with Scrapy
instead of requests
, and parse the HTML with Scrapy
instead of BeautifulSoup
.