0

I am trying to build a crawler using scrapy and selenium webdriver. I am trying to get a set of urls in parse() and pass it to a callback function parse_url() which again gets a different set of urls and passes it to parse_data()

The first callback to parse_url works but the second to parse_data gives an AssertionError

i.e if I run without parse_data it prints a list of urls. But if I include it I get an assertion error

I have something like this

class mySpider(scrapy.Spider):
    name = "mySpider"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/url",
    ]

    def parse(self, response):
        driver = webdriver.firefox()
    driver.get(response.url)
    urls = get_urls(driver.page_source) # get_url returns a list
        yield scrapy.Request(urls, callback=self.parse_url(urls, driver))

    def parse_url(self, url, driver):
        url_list = []
    for i in urls:
    driver.get(i)
    url_list.append( get_urls(driver.pagesource)) # gets some more urls 
    yeild scrapy.Request(urls, callback=self.parse_data(url_list, driver))

    def parse_data(self, url_list, driver):
        data = get_data(driver.pagesource)

This is the traceback,

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/spidermw.py", line 48, in process_spider_input
    return scrape_func(response, request, spider)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 145, in call_spider
    dfd.addCallbacks(request.callback or spider.parse, request.errback)
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 299, in addCallbacks
    assert callable(callback)
AssertionError
bwayne
  • 13
  • 4
  • Just as a side not: it seems for me you are doing quite much with Selenium which could be done with Scrapy too: extracting the URLs from the site (currently you load the sites two times: once with Scrapy and then with Selenium). – GHajba Jul 08 '15 at 09:32
  • The web page has dynamic content. In the actual code I have to scroll down and let elements load. Hence Selenium – bwayne Jul 08 '15 at 09:39

1 Answers1

0

There are two problems:

  1. You're not passing your function to the request. You are passing the return value of your function to the request.

  2. A callback function for Request must have the signature (self, response).

A solution to the dynamic content is here : https://stackoverflow.com/a/24373576/2368836

It will eliminate the need to pass the driver into the function.

So when yielding your request it should be like so...

scrapy.Request(urls, callback=self.parse_url)

If you really want to include the driver with that function read about closures.

Edit: Here is a closure solution but I think you should use the link I shared because of the reasons GHajba pointed out.

   def parse_data(self, url_list, driver):
        def encapsulated(spider, response)
            data = get_data(driver.pagesource)
            .....
            .....
            yield item
    return encapsulated

Then your request looks like

yield scrapy.request(url, callback=self.parse_data(url_list, driver)
Community
  • 1
  • 1
rocktheartsm4l
  • 2,129
  • 23
  • 38
  • Thank you! [Spider middlewares](http://doc.scrapy.org/en/latest/topics/spider-middleware.html#spider-middleware) are what I was looking for but I didn't know it. – bwayne Jul 29 '15 at 09:52