6

I need to raise CloseSpider from a Scrapy Pipeline. Either that or return some parameter from the Pipeline back to the Spider to do the raise.

For example, if the date already exists raise CloseSpider:

raise CloseSpider('Already been scraped:' + response.url)

Is there a way to do this?

MoreScratch
  • 2,933
  • 6
  • 34
  • 65
  • Quite related: https://stackoverflow.com/a/9699317/771848. – alecxe May 20 '18 at 04:31
  • Cant call close spider from pipelines. use a hack by setting a variable in spider instance you get in a pipeline process_item function –  Sep 26 '19 at 09:01

2 Answers2

3

As from scrapy docs, CloseSpider Exception can only be raised from a callback function (by default parse function) in a Spider only. Raising it in pipeline will crash spider. To achieve the similar results from a pipeline, you can initiate a shutdown signal, that will close scrapy gracefully.

from scrapy.project import crawler  
crawler._signal_shutdown(9,0)

Do remember ,scrapy might process already fired or even scheduled requests even after initiating shutdown signal.

To do it from Spider, set some variable in Spider from Pipeline like this.

def process_item(self, item, spider):
    if some_condition_is_met:
        spider.close_manually = True

After this in the callback function of your spider , you can raise close spider exception.

def parse(self, response):
    if self.close_manually:
        raise CloseSpider('Already been scraped.')
1

I prefer the following solution.

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy

Macbric
  • 472
  • 4
  • 10
  • Somehow this is not working. It seems like the preferred solution to stopping a spider is to stop yielding in the parse functions – vladimir.gorea Apr 24 '20 at 05:16