I use scrapy to crawl user pages in 'douban.com'. And I have 2W users in my database, I need to crawl all of these independant users' pages.
But problem is that sometimes the website would block my crawler, and if I immediate notice, I can manually shutdown the spider by Ctrl+C and restart the spider and keep going. In the way of simulating this behaviour I meet a lot of problems, I have two idea which is shown below:
- pause the spider inside scrapy Detect 403 page since it's the sign of being blocked. Adding these code in parse function:
if response.status == 403: reactor.callLater(0, lambda: time.sleep(60))
This is not working, because sleep doesnt cause connections close, no matter how long it sleep, it won't be the same as manually restart the spider.
- split the start_urls and start spiders one by one Since one start_url stands for one user, I split this start_urls list and put it in different spider. Then I start the spider by script (http://doc.scrapy.org/en/0.24/topics/practices.html#run-from-script) And then I find out that twisted reactor CAN NOT be restart!
So I have no idea of how to pause the scrapy totally and automatically restart it.