use scrapy to crawl node

Question

I'm trying to use scrapy to crawl some advertise information from this web sites. That website has some div tag with class="product-card new_ outofstock installments_ ".

When I use:

items = response.xpath("//div[contains(@class, 'product-')]")

I get some node with class attribute = "product-description" but not "product-card".

When I use:

items = response.xpath("//div[contains(@class, 'product-card')]")

I still get nothing in result.

Why is that ?

score 0 · Answer 1 · edited May 23 '17 at 12:31

The data you want is being populated by javascripts.

You would have to use a selenium webdriver to extract the data.

If you want to check before hand if data is being populated using javascript, open a scrapy shell and try extracting the data as below.

scrapy shell 'http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT'

>>>response.xpath('//div[contains(@class,"product-card")]')

Output:

[]

Now, if you use the same Xpath in the browser and get a result as below:

Then the data is populated using scripts and selenium would have to be used to get data.

Here is an example to extract data using selenium:

import scrapy
from selenium import webdriver
from scrapy.http import TextResponse

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['lazada.vn']
    start_urls = ['http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        page = TextResponse(response.url, body=self.driver.page_source, encoding='utf-8')
        required_data = page.xpath('//div[contains(@class,"product-card")]').extract()

        self.driver.close()

Here are some examples of "selenium spiders":

Thanks but i get eror: NameError: global name 'TextResponse' is not defined — Noobscripter, Jan 15 '16 at 09:51

score 0 · Answer 2 · answered Jan 15 '16 at 09:13

As pointed in the previous answer, the content you are trying to scrape is generated dynamically using javascript. If performance is not a big deal for you, then you can use Selenium to emulate a real user and interact with the site. At the same time you can let Scrapy get the data for you.

If you want a similar example of how to do this, consider this tutorial: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/

use scrapy to crawl node

2 Answers2