1

I have list of thousands of URL which I scrape using one Spider. Some URLs has the same domain. I want to count a number of Timeout errors per domain. If for domain x, is a number of Timeouts higher than LIMIT, I want to avoid scraping all URLs of this domain. Instead, I want to treat them as it were normal Timeout (without real requesting) - my spider then handle Timeouts in errback function.

So the point is that I don't have to keep trying to scrape URLs from domain X if this domain raised 5 Timeouts already. I just want to fake those Timeouts without actually sending requests.

I've created a middleware for this purpose, which seems to be a good way but it has some bugs.

For example, if an URL has to be automatically Timeouted (without real request), it seems that scrapy tries to repeat this request multiple times.

LOG:

2017-05-22 20:37:16 [engineapp.engine.scrapy.scrapytest.scrapytest.middlewares] INFO: Arbitrary Timeout! OCID: 12751
2017-05-22 20:37:16 [engineapp.engine.scrapy.scrapytest.scrapytest.middlewares] INFO: Arbitrary Timeout! OCID: 12751
2017-05-22 20:37:16 [engineapp.engine.scrapy.scrapytest.scrapytest.middlewares] INFO: Arbitrary Timeout! OCID: 12751

Maybe the problem is that I'm not returning anything from process_request and process_exception functions.

This is the middleware:

class TimeoutProcessMiddleware():
    timeouts = defaultdict(int)

    def process_request(self,request, spider):
        logger.debug(self.timeouts)
        occ = request.meta['occ']
        domain = Site.parse_site_from_url(request.url)

        if self.timeouts[domain] >= settings.TIMEOUT_COUNT_LIMIT: # If we detected timeout for more than x URLs with the same domain
            logger.info('Arbitrary Timeout! OCID: {}'.format(occ.id))
            raise TimeoutError()

    def process_exception(self,request, exception, spider):
        logger.debug('Processing exception: EXC: {}'.format(exception))
        if exception.__class__ in [TimeoutError,TCPTimedOutError]:
            domain = Site.parse_site_from_url(request.url)
            self.timeouts[domain]+=1

This is a part of Spider errback:

class MainSpider(scrapy.Spider):
     ...

     def err(self, failure):
            if failure.check(HttpError):
                ....
            elif failure.check(TimeoutError, TCPTimedOutError): # HERE I WANT TO PROCESS ALL TIMEOUTS 
                self.process_timeout_error(occ)

Do you know what to do?

Milano
  • 18,048
  • 37
  • 153
  • 353

0 Answers0