7

I wrote a crawler with Scrapy.

There is a function in the pipeline where I write my data to a database. I use the logging module to log runtime logs.

I found that when my string have Chinese logging.error() will throw an exception. But the crawler keeps running!

I know this is a minor error but if there is a critical exception I will miss it if crawler keeps running.

My question is: Is there a setting that I can force Scrapy stop when there is an exception?

danday74
  • 52,471
  • 49
  • 232
  • 283
scott huang
  • 2,478
  • 4
  • 21
  • 36

3 Answers3

10

You can use CLOSESPIDER_ERRORCOUNT

An integer which specifies the maximum number of errors to receive before closing the spider. If the spider generates more than that number of errors, it will be closed with the reason closespider_errorcount. If zero (or non set), spiders won’t be closed by number of errors.

By default it is set to 0 CLOSESPIDER_ERRORCOUNT = 0 you can change it to 1 if you want to exit when you have the first error.

UPDATE

Read the answers of this question, you can also use:

crawler.engine.close_spider(self, 'log message')

for more information read :

Close spider extension

parik
  • 2,313
  • 12
  • 39
  • 67
  • 1
    I missed that one! Good option. – paul trmbrth Jun 08 '17 at 09:39
  • Hi parik, I think your answers is what I want. I add following code in my spider, but it does not work, can you help me on this?: EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, 'scrapy.extensions.closespider.CloseSpider': 100, } CLOSESPIDER_ERRORCOUNT = 1 – scott huang Jun 09 '17 at 07:55
  • @scotthuang update your question and put what you tried and the error messages please – parik Jun 09 '17 at 08:13
  • 1
    Hi parik, I fund that it actually works!. I tested it with a database exception. When I test this extension with Out Of Index exception, Scrapy stops. I will dig deeper to find out why, thanks for you advice. – scott huang Jun 09 '17 at 08:56
3

In the process_item function of your spider you have an instance of spider.

To solve your problem you could catch the exceptions when you insert your data, then neatly stop you spider if you catch a certain exeption like this:

 def process_item(self, item, spider):
    try:
        #Insert your item here
    except YourExceptionName:
        spider.crawler.engine.close_spider(self, reason='finished')
Adrien Blanquer
  • 2,041
  • 1
  • 19
  • 31
1

I don't know of a setting that would close the crawler on any exception, but you have at least a couple of options:

  • you can raise CloseSpider exception in a spider callback, maybe when you catch that exception you mention
  • you can call crawler.engine.close_spider(spider, 'some reason') if you have a reference to the crawler and spider object, for example in an extension. See how the CloseSpider extension is implemented (it's not the same as the CloseSpider exception). You could hook this with the spider_error signal for example.
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66