14

We have a system written with scrapy to crawl a few websites. There are several spiders, and a few cascaded pipelines for all items passed by all crawlers. One of the pipeline components queries the google servers for geocoding addresses. Google imposes a limit of 2500 requests per day per IP address, and threatens to ban an IP address if it continues querying google even after google has responded with a warning message: 'OVER_QUERY_LIMIT'.

Hence I want to know about any mechanism which I can invoke from within the pipeline that will completely and immediately stop all further crawling/processing of all spiders and also the main engine.

I have checked other similar questions and their answers have not worked:

from scrapy.project import crawler
crawler._signal_shutdown(9,0) #Run this if the cnxn fails.

this does not work as it takes time for the spider to stop execution and hence many more requests are made to google (which could potentially ban my IP address)

import sys
sys.exit("SHUT DOWN EVERYTHING!")

this one doesn't work at all; items keep getting generated and passed to the pipeline, although the log vomits sys.exit() -> exceptions.SystemExit raised (to no effect)

crawler.engine.close_spider(self, 'log message')

this one has the same problem as the first case mentioned above.

I tried:

scrapy.project.crawler.engine.stop()

To no avail

EDIT: If I do in the pipeline:

from scrapy.contrib.closespider import CloseSpider

what should I pass as the 'crawler' argument to the CloseSpider's init() from the scope of my pipeline?

Community
  • 1
  • 1
aniketd
  • 385
  • 1
  • 3
  • 15

1 Answers1

18

You can raise a CloseSpider exception to close down a spider. However, I don't think this will work from a pipeline.

EDIT: avaleske notes in the comments to this answer that he was able to raise a CloseSpider exception from a pipeline. Most wise would be to use this.

A similar situation has been described on the Scrapy Users group, in this thread.

I quote:

To close an spider for any part of your code you should use engine.close_spider method. See this extension for an usage example: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/closespider.py#L61

You could write your own extension, whilst looking at closespider.py as an example, which will shut down a spider if a certain condition has been met.

Another "hack" would be to set a flag on the spider in the pipeline. For example:

pipeline:

def process_item(self, item, spider):
    if some_flag:
        spider.close_down = True

spider:

def parse(self, response):
    if self.close_down:
        raise CloseSpider(reason='API usage exceeded')
Sjaak Trekhaak
  • 4,906
  • 30
  • 39
  • Thanks for the post. I figure this will close the spider as like the first example shown above but it takes time and a few items from each scheduled spider will still go through the pipeline. Which means that 100s of queries will still be made to google after the warning has been received... How do I KILL the whole thing??? If there can't be a way at all, I shall use the "hack"! Thanks a lot!!! – aniketd Mar 14 '12 at 10:02
  • Also the CloseSpider class takes a 'crawler' argument. In my pipeline and its scope what object should be passed? – aniketd Mar 14 '12 at 10:18
  • 1
    I'm not really sure what you are referring to; but this doc about extensions might help: http://doc.scrapy.org/en/latest/topics/extensions.html and the doc about pipelines: http://doc.scrapy.org/en/latest/topics/item-pipeline.html . I would pass the spider on to the pipeline, set the flag there, and raise a CloseSpider exception in the spider itself. – Sjaak Trekhaak Mar 14 '12 at 10:27
  • 1
    Scrapy is async, so by the time you process the response, a bunch of extra requests have already been made. Even if you stop the crawler immediately, it's still too late so don't sweat it. A couple of extra 100s wouldn't get you permanent ban from google. – Kien Truong Mar 14 '12 at 11:46
  • Thanks for the "hack" suggestion @SjaakTrekhaak . – aniketd Mar 14 '12 at 11:51
  • 2
    I was just able to raise a CloseSpider exception from within a pipeline. – avaleske Dec 28 '12 at 22:14
  • @Dikei: Being async does not mean that doing 100 more scrapes after a request for stop is valid. The proper solution would be to push back items from the download queue into a to be scraped list... – Vajk Hermecz Feb 25 '13 at 12:59
  • FYI - I just raised a raise CloseSpider("let's close everything!") from the pipeline which did not stop my CrawlSpider's crawl (which uses a Rule() and a LinkExtractor())... using the flag to raise the exception from inside the spider did the trick though :) – UriCS Aug 30 '16 at 02:40