0

I am having difficulty to achieve my scraper (I took the initial example code from here[selenium with scrapy for dynamic page from @alecxe, and completed for getting some results, but if the scraper seems to lauch (we can observe the simulation of clicking the next button), it shuts down one second after and doesn't print or get anything in the items.

Here is the code

from scrapy.spider import BaseSpider 
from selenium import webdriver

class product_spiderItem(scrapy.Item):
    title = scrapy.Field()
    price=scrapy.Field()
    pass

class ProductSpider(BaseSpider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

            # get the data and write it to scrapy items
                response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
                print response.url
                for prod in response.xpath('//ul[@id="GalleryViewInner"]/li/div/div'):
                    item = product_spiderItem()
                    item['title'] = prod.xpath('.//div[@class="gvtitle"]/h3/a/text()').extract()[0]
                    item['price'] = prid.xpath('.//div[@class="prices"]/span[@class="bold"]/text()').extract()[0]
                    print item['price']
                    yield item

            except:
                break

        self.driver.close()

I use scrapy crawl product_scraper -o products.json, to store results.what am i missing?

Community
  • 1
  • 1
Stéphanie C
  • 809
  • 8
  • 31

1 Answers1

2

While trying to understand what's wrong with your code I did some editing and came up with the following (tested) code that should be closer to your goal:

import scrapy
from selenium import webdriver

class product_spiderItem(scrapy.Item):
    title = scrapy.Field()
    price=scrapy.Field()
    pass

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:

            sel = scrapy.Selector(text=self.driver.page_source)

            for prod in sel.xpath('//ul[@id="GalleryViewInner"]/li/div/div'):
                item = product_spiderItem()
                item['title'] = prod.xpath('.//div[@class="gvtitle"]/h3/a/text()').extract()
                item['price'] = prod.xpath('.//div[@class="prices"]//span[@class=" bold"]/text()').extract()
                yield item

            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

            except:
                break

    def closed(self, reason):
        self.driver.close()

Please try if this code works better.

Frank Martin
  • 2,584
  • 2
  • 22
  • 25
  • It's perfectly working ! Thanks a lot !! I guess that the main reason is in the selection of the page content with sel = scrapy.Selector(text=self.driver.page_source) instead of Textresponse. But Textresponse seemed to work in other codes I've seen given as example. – Stéphanie C Jul 02 '15 at 15:59
  • 1
    You're welcome :-) Please note, that I removed the [0] after the extract() to prevent errors when no element is found. – Frank Martin Jul 02 '15 at 16:05