2

I have a spider where I have to use Selenium to scrape dynamic data on page. Here's what it looks like:

class MySpider(
    name = 'myspider'
    start_urls = ['http://example.org']

    def __init__(self, *args, **kwargs):
        super(, self).__init__(*args, **kwargs)
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(5)
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        if self.driver:
            self.driver.quit()
            self.driver = None

The problem here is that when I cancel job in Scrapyd it doesn't stop until I manually close the window. I obviously won't be able to do that when I deploy the spider to the real server.

Here's what I see in Scrapyd logs each time I hit "Cancel":

2015-08-12 13:48:13+0300 [HTTPChannel,208,127.0.0.1] Unhandled Error
    Traceback (most recent call last):
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/http.py", line 1731, in allContentReceived
        req.requestReceived(command, path, version)
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/http.py", line 827, in requestReceived
        self.process()
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/server.py", line 189, in process
        self.render(resrc)
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/server.py", line 238, in render
        body = resrc.render(self)
    --- <exception caught here> ---
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/scrapyd/webservice.py", line 18, in render
        return JsonResource.render(self, txrequest)
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/scrapy/utils/txweb.py", line 10, in render
        r = resource.Resource.render(self, txrequest)
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/resource.py", line 250, in render
        return m(request)
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/scrapyd/webservice.py", line 55, in render_POST
        s.transport.signalProcess(signal)
      File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/internet/process.py", line 339, in signalProcess
        raise ProcessExitedAlready()
    twisted.internet.error.ProcessExitedAlready: 

But the job is still in the job list and it's marked as "Running". So how can I shutdown the driver?

Dmitrii Mikhailov
  • 5,053
  • 7
  • 43
  • 69

2 Answers2

0

Import SignalManager:

from scrapy.signalmanager import SignalManager

Then replace:

dispatcher.connect(self.spider_closed, signals.spider_closed)

With:

SignalManager(dispatcher.Any).connect(self.spider_closed, signal=signals.spider_closed)
Rejected
  • 4,445
  • 2
  • 25
  • 42
0

Have you tried implementing from_crawler on the spider? I've only done this on pipelines, and extensions but it should work the same for spiders..

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    o = cls(*args, **kwargs)
    crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
    return o

http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.from_crawler

rocktheartsm4l
  • 2,129
  • 23
  • 38