How can I pause a Scrapy when I meet website blocking?

Question

I use scrapy to crawl user pages in 'douban.com'. And I have 2W users in my database, I need to crawl all of these independant users' pages.

But problem is that sometimes the website would block my crawler, and if I immediate notice, I can manually shutdown the spider by Ctrl+C and restart the spider and keep going. In the way of simulating this behaviour I meet a lot of problems, I have two idea which is shown below:

pause the spider inside scrapy Detect 403 page since it's the sign of being blocked. Adding these code in parse function:

if response.status == 403:
    reactor.callLater(0, lambda: time.sleep(60))

This is not working, because sleep doesnt cause connections close, no matter how long it sleep, it won't be the same as manually restart the spider.

split the start_urls and start spiders one by one Since one start_url stands for one user, I split this start_urls list and put it in different spider. Then I start the spider by script (http://doc.scrapy.org/en/0.24/topics/practices.html#run-from-script) And then I find out that twisted reactor CAN NOT be restart!

So I have no idea of how to pause the scrapy totally and automatically restart it.

maybe this will help you http://stackoverflow.com/a/9699317/4493674 https://scrapy.readthedocs.org/en/latest/topics/exceptions.html?highlight=closeSpider — Cristian Olaru, Feb 13 '15 at 03:26
@CristianOlaru In method no.2, I do raise a CloseSpider execption, but the spider can not be restart either. — prehawk, Feb 13 '15 at 03:32

score 0 · Answer 1 · answered Feb 13 '15 at 07:44

0

You can use errback of scarpy requests, like this

    return Request(url, callback=parse, errback=error_handler)

and define your error handler like this

    def error_handler(self, failure):
        time.sleep(time_to_sleep) //time in seconds
        //after time expires, send next request

it will handle all response statuses other than 200.

answered Feb 13 '15 at 07:44

Tasawer Nawaz

927
8
19

Nope, this is not working, first, twisted won't disconnect from website, second, this will cause exception. – prehawk Feb 13 '15 at 07:53
what kind of exception ? – Tasawer Nawaz Feb 13 '15 at 09:41
Some exception about the reactor. Then I replace it by: reactor.callLater(0, lambda: time.sleep(60)), still not working because it just sleep, not disconnect. – prehawk Feb 13 '15 at 11:38
Have you tried creating new session after sleeping ? – Tasawer Nawaz Feb 13 '15 at 11:47
What is a new session? – prehawk Feb 14 '15 at 01:51
I think besides sleeping, you should also clear the cookies after detecting a block. The site may be tracking your spider through cookies and not just through your IP address. – marven Feb 15 '15 at 23:10
Yes marven is right, with new session I meant same that you should try to delete cookies. see this Flag in request "dont_merge_cookies=True" – Tasawer Nawaz Feb 16 '15 at 07:25

How can I pause a Scrapy when I meet website blocking?

1 Answers1