0

I'm trying to use scrapy to crawl some advertise information from this web sites. That website has some div tag with class="product-card new_ outofstock installments_ ".

When I use:

items = response.xpath("//div[contains(@class, 'product-')]")

I get some node with class attribute = "product-description" but not "product-card".

When I use:

items = response.xpath("//div[contains(@class, 'product-card')]")

I still get nothing in result.

Why is that ?

Vaulstein
  • 20,055
  • 8
  • 52
  • 73

2 Answers2

0

The data you want is being populated by javascripts.

You would have to use a selenium webdriver to extract the data.

If you want to check before hand if data is being populated using javascript, open a scrapy shell and try extracting the data as below.

scrapy shell 'http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT'

>>>response.xpath('//div[contains(@class,"product-card")]')

Output:

[]

Now, if you use the same Xpath in the browser and get a result as below: enter image description here

Then the data is populated using scripts and selenium would have to be used to get data.

Here is an example to extract data using selenium:

import scrapy
from selenium import webdriver
from scrapy.http import TextResponse

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['lazada.vn']
    start_urls = ['http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        page = TextResponse(response.url, body=self.driver.page_source, encoding='utf-8')
        required_data = page.xpath('//div[contains(@class,"product-card")]').extract()

        self.driver.close()

Here are some examples of "selenium spiders":

  1. Executing Javascript Submit form functions using scrapy in python
  2. Snipplr
  3. Scrapy with selenium
  4. Extract data from dynamic webpages
Community
  • 1
  • 1
Vaulstein
  • 20,055
  • 8
  • 52
  • 73
0

As pointed in the previous answer, the content you are trying to scrape is generated dynamically using javascript. If performance is not a big deal for you, then you can use Selenium to emulate a real user and interact with the site. At the same time you can let Scrapy get the data for you.

If you want a similar example of how to do this, consider this tutorial: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/

narko
  • 3,645
  • 1
  • 28
  • 33