scrapy: Call a function when a spider quits

Question

Is there a way to trigger a method in a Spider class just before it terminates?

I can terminate the spider myself, like this:

class MySpider(CrawlSpider):
    #Config stuff goes here...

    def quit(self):
        #Do some stuff...
        raise CloseSpider('MySpider is quitting now.')

    def my_parser(self, response):
        if termination_condition:
            self.quit()

        #Parsing stuff goes here...

But I can't find any information on how to determine when the spider is about to quit naturally.

score 84 · Accepted Answer · edited May 02 '20 at 12:05

84

It looks like you can register a signal listener through dispatcher.

I would try something like:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
      # second param is instance of spder about to be closed.

In the newer version of scrapy scrapy.xlib.pydispatch is deprecated. instead you can use from pydispatch import dispatcher.

edited May 02 '20 at 12:05

TheDevWay

1,363
1
16
45

answered Sep 12 '12 at 18:40

dm03514

54,664
18
108
145

5

Works perfectly. But I'd suggest naming the method MySpider.quit() or something similar, to avoid confusion with the signal name. Thanks! – Abe Sep 12 '12 at 18:52
Excellent solution. And yes, the example should work exactly the same with a `CrawlSpider`. – Daniel Werner Sep 13 '12 at 19:23
This solution also work fine on Scrapy **0.20.0**, contrary to what @Chris said below. – not2qubit Jan 04 '14 at 20:06
This solution also work fine on Scrapy 0.24.4, contrary to what @Chris said below. – shellbye Dec 25 '14 at 02:44
1

I'm confused by why the second parameter of spider_closed is necessary. Isn't the spider to be closed self? – chishaku Mar 09 '15 at 09:26
2

Doesn't work with v. 1.1 because xlib.pydispatch was deprecated. Instead, they recommend to use PyDispatcher. Though couldn't make it work yet... – Desprit Sep 16 '16 at 12:14
Fabolous! This is exactly what I was looking for! And works perfectly fine! Great input mate! And thanks :3 – wj127 Mar 15 '17 at 14:51
This *still* works in `Python 3.6.4`, with `Scrapy 1.5.1` and using `PyDispatcher 2.0.5`, and even if you also have a `def spider_closed(..)` in some pipeline *Class* in your `pipelines.py`. However, it is also **deprecated** as shown [here](https://github.com/scrapy/scrapy/issues/1762), so use the *new* method as explained by @Levon. – not2qubit Oct 02 '18 at 13:06
3

In the newer version of scrapy `scrapy.xlib.pydispatch` is deprecated. insted of this you can use `from pydispatch import dispatcher` – Mrugesh Kadia Feb 22 '20 at 13:24

score 65 · Answer 2 · answered Oct 23 '15 at 22:29

65

Just to update, you can just call closed function like this:

class MySpider(CrawlSpider):
    def closed(self, reason):
        do-something()

answered Oct 23 '15 at 22:29

THIS USER NEEDS HELP

3,136
4
30
55

6

In my scrapy it's `def close(self, reason):`, not `closed` – Aminah Nuraini Nov 02 '15 at 20:48
5

@AminahNuraini Scrapy 1.0.4 `def closed(reason)` – El Ruso Jan 29 '16 at 23:14

Levon · Answer 3 · 2016-10-13T08:52:42.567

For Scrapy version 1.0.0+ (it may also work for older versions).

from scrapy import signals

class MySpider(CrawlSpider):
    name = 'myspider'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        print('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        print('Closing {} spider'.format(spider.name))

One good usage is to add tqdm progress bar to scrapy spider.

# -*- coding: utf-8 -*-
from scrapy import signals
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tqdm import tqdm


class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['somedomain.comm']
    start_urls = ['http://www.somedomain.comm/ccid.php']

    rules = (
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccds.php\?id=.*'),
             callback='parse_item',
             ),
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccid.php$',
                           restrict_xpaths='//table/tr[contains(., "SMTH")]'), follow=True),
    )

    def parse_item(self, response):
        self.pbar.update()  # update progress bar by 1
        item = MyItem()
        # parse response
        return item

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        self.pbar = tqdm()  # initialize progress bar
        self.pbar.clear()
        self.pbar.write('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        self.pbar.clear()
        self.pbar.write('Closing {} spider'.format(spider.name))
        self.pbar.close()  # close progress bar

This is the **new** method! Although it look less transparent, it's advantage is to remove the extra clutter of using: `def __init__(self):..` and the PyDispatcher import with `from scrapy.xlib.pydispatch import dispatcher`. — not2qubit, Oct 02 '18 at 13:12

score 12 · Answer 4 · edited Jun 20 '20 at 09:12

12

For the latest version(v1.7), just define closed(reason) method in your spider class.

closed(reason):

Called when the spider closes. This method provides a shortcut to signals.connect() for the spider_closed signal.

Scrapy Doc : scrapy.spiders.Spider.closed

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 13 '19 at 07:29

thomasXu

161
1
5

score 7 · Answer 5 · edited Mar 19 '18 at 04:11

For me the accepted did not work / is outdated at least for scrapy 0.19. I got it to work with the following though:

from scrapy.signalmanager import SignalManager
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        SignalManager(dispatcher.Any).connect(
            self.closed_handler, signal=signals.spider_closed)

    def closed_handler(self, spider):
        # do stuff here

score 1 · Answer 6 · answered Apr 05 '17 at 16:04

if you have many spiders and want to do something before each of them closing, maybe it will be convenient to add statscollector in your project.

in settings:

STATS_CLASS = 'scraper.stats.MyStatsCollector'

and collector:

from scrapy.statscollectors import StatsCollector

class MyStatsCollector(StatsCollector):
    def _persist_stats(self, stats, spider):
        do something here

scrapy: Call a function when a spider quits

6 Answers6

Linked