2

I use scrapy to crawl user pages in 'douban.com'. And I have 2W users in my database, I need to crawl all of these independant users' pages.

But problem is that sometimes the website would block my crawler, and if I immediate notice, I can manually shutdown the spider by Ctrl+C and restart the spider and keep going. In the way of simulating this behaviour I meet a lot of problems, I have two idea which is shown below:

  1. pause the spider inside scrapy Detect 403 page since it's the sign of being blocked. Adding these code in parse function:
if response.status == 403:
    reactor.callLater(0, lambda: time.sleep(60))

This is not working, because sleep doesnt cause connections close, no matter how long it sleep, it won't be the same as manually restart the spider.

  1. split the start_urls and start spiders one by one Since one start_url stands for one user, I split this start_urls list and put it in different spider. Then I start the spider by script (http://doc.scrapy.org/en/0.24/topics/practices.html#run-from-script) And then I find out that twisted reactor CAN NOT be restart!

So I have no idea of how to pause the scrapy totally and automatically restart it.

prehawk
  • 195
  • 12
  • maybe this will help you http://stackoverflow.com/a/9699317/4493674 https://scrapy.readthedocs.org/en/latest/topics/exceptions.html?highlight=closeSpider – Cristian Olaru Feb 13 '15 at 03:26
  • @CristianOlaru In method no.2, I do raise a CloseSpider execption, but the spider can not be restart either. – prehawk Feb 13 '15 at 03:32

1 Answers1

0

You can use errback of scarpy requests, like this

    return Request(url, callback=parse, errback=error_handler)

and define your error handler like this

    def error_handler(self, failure):
        time.sleep(time_to_sleep) //time in seconds
        //after time expires, send next request

it will handle all response statuses other than 200.

Tasawer Nawaz
  • 927
  • 8
  • 19