55

Is there a way to trigger a method in a Spider class just before it terminates?

I can terminate the spider myself, like this:

class MySpider(CrawlSpider):
    #Config stuff goes here...

    def quit(self):
        #Do some stuff...
        raise CloseSpider('MySpider is quitting now.')

    def my_parser(self, response):
        if termination_condition:
            self.quit()

        #Parsing stuff goes here...

But I can't find any information on how to determine when the spider is about to quit naturally.

Abe
  • 22,738
  • 26
  • 82
  • 111

6 Answers6

84

It looks like you can register a signal listener through dispatcher.

I would try something like:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
      # second param is instance of spder about to be closed.

In the newer version of scrapy scrapy.xlib.pydispatch is deprecated. instead you can use from pydispatch import dispatcher.

TheDevWay
  • 1,363
  • 1
  • 16
  • 45
dm03514
  • 54,664
  • 18
  • 108
  • 145
  • 5
    Works perfectly. But I'd suggest naming the method MySpider.quit() or something similar, to avoid confusion with the signal name. Thanks! – Abe Sep 12 '12 at 18:52
  • Excellent solution. And yes, the example should work exactly the same with a `CrawlSpider`. – Daniel Werner Sep 13 '12 at 19:23
  • This solution also work fine on Scrapy **0.20.0**, contrary to what @Chris said below. – not2qubit Jan 04 '14 at 20:06
  • This solution also work fine on Scrapy 0.24.4, contrary to what @Chris said below. – shellbye Dec 25 '14 at 02:44
  • 1
    I'm confused by why the second parameter of spider_closed is necessary. Isn't the spider to be closed self? – chishaku Mar 09 '15 at 09:26
  • 2
    Doesn't work with v. 1.1 because xlib.pydispatch was deprecated. Instead, they recommend to use PyDispatcher. Though couldn't make it work yet... – Desprit Sep 16 '16 at 12:14
  • Fabolous! This is exactly what I was looking for! And works perfectly fine! Great input mate! And thanks :3 – wj127 Mar 15 '17 at 14:51
  • This *still* works in `Python 3.6.4`, with `Scrapy 1.5.1` and using `PyDispatcher 2.0.5`, and even if you also have a `def spider_closed(..)` in some pipeline *Class* in your `pipelines.py`. However, it is also **deprecated** as shown [here](https://github.com/scrapy/scrapy/issues/1762), so use the *new* method as explained by @Levon. – not2qubit Oct 02 '18 at 13:06
  • 3
    In the newer version of scrapy `scrapy.xlib.pydispatch` is deprecated. insted of this you can use `from pydispatch import dispatcher` – Mrugesh Kadia Feb 22 '20 at 13:24
65

Just to update, you can just call closed function like this:

class MySpider(CrawlSpider):
    def closed(self, reason):
        do-something()
THIS USER NEEDS HELP
  • 3,136
  • 4
  • 30
  • 55
19

For Scrapy version 1.0.0+ (it may also work for older versions).

from scrapy import signals

class MySpider(CrawlSpider):
    name = 'myspider'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        print('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        print('Closing {} spider'.format(spider.name))

One good usage is to add tqdm progress bar to scrapy spider.

# -*- coding: utf-8 -*-
from scrapy import signals
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tqdm import tqdm


class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['somedomain.comm']
    start_urls = ['http://www.somedomain.comm/ccid.php']

    rules = (
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccds.php\?id=.*'),
             callback='parse_item',
             ),
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccid.php$',
                           restrict_xpaths='//table/tr[contains(., "SMTH")]'), follow=True),
    )

    def parse_item(self, response):
        self.pbar.update()  # update progress bar by 1
        item = MyItem()
        # parse response
        return item

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        self.pbar = tqdm()  # initialize progress bar
        self.pbar.clear()
        self.pbar.write('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        self.pbar.clear()
        self.pbar.write('Closing {} spider'.format(spider.name))
        self.pbar.close()  # close progress bar
Levon
  • 10,408
  • 4
  • 47
  • 42
  • 1
    This is the **new** method! Although it look less transparent, it's advantage is to remove the extra clutter of using: `def __init__(self):..` and the PyDispatcher import with `from scrapy.xlib.pydispatch import dispatcher`. – not2qubit Oct 02 '18 at 13:12
12

For the latest version(v1.7), just define closed(reason) method in your spider class.

closed(reason):

Called when the spider closes. This method provides a shortcut to signals.connect() for the spider_closed signal.

Scrapy Doc : scrapy.spiders.Spider.closed

Community
  • 1
  • 1
thomasXu
  • 161
  • 1
  • 5
7

For me the accepted did not work / is outdated at least for scrapy 0.19. I got it to work with the following though:

from scrapy.signalmanager import SignalManager
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        SignalManager(dispatcher.Any).connect(
            self.closed_handler, signal=signals.spider_closed)

    def closed_handler(self, spider):
        # do stuff here
THIS USER NEEDS HELP
  • 3,136
  • 4
  • 30
  • 55
Chris
  • 392
  • 3
  • 9
1

if you have many spiders and want to do something before each of them closing, maybe it will be convenient to add statscollector in your project.

in settings:

STATS_CLASS = 'scraper.stats.MyStatsCollector'

and collector:

from scrapy.statscollectors import StatsCollector

class MyStatsCollector(StatsCollector):
    def _persist_stats(self, stats, spider):
        do something here
slavugan
  • 1,463
  • 17
  • 17