1

Using Scrapy i'm implementing a CrawlSpider which will scrape all kinds of websites and hence, sometimes very slow ones which will produce a timeout eventually.
My problem is that if such a twisted.internet.error.TimeoutError occurs, i want to trigger the errback of my spider. I don't want to raise this exception and i also don't want to return a dummy Response object which may would suggest that scraping was successful.
Note that i was already able to made all of this work, but only using a "dirty" workaround:

myspider.py (excerpt)

class MySpider(CrawlSpider):
name = 'my-spider'

rules = (
    Rule(
        link_extractor=LinkExtractor(unique=True),
        callback='_my_callback', follow=True
    ),
)

def parse_start_url(self, response):
    # (...)

def errback(self, failure):
    log.warning('Failed scraping following link: {}'
        .format(failure.request.url))

middlewares.py (excerpt)

from twisted.internet.error import DNSLookupError, TimeoutError

# (...)

class MyDownloaderMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):

        if (isinstance(exception, TimeoutError)
            or (isinstance(exception, DNSLookupError))):
            # just 2 examples of errors i want to catch

            # set status=500 to enforce errback() call
            return Response(request.url, status=500)

Settings should be fine with my custom Middleware already enabled.

Now as you can see by using return Response(request.url, status=500) i can trigger my errback() function in MySpider as desired. However, the status code 500 is very misleading because it's not only incorrect but technically i never receive any response at all.

So my question is, how can i trigger my errback() function trough DownloaderMiddleware.process_exception() in a clean way?

EDIT: I quickly figured it out that for similar exceptions like DNSLookupError i want to have the same behaviour in place. I've updated the coding snippets accordingly.

nichoio
  • 6,289
  • 4
  • 26
  • 33

2 Answers2

1

I didn't find it in the docs, but looking at the source I find DownloaderMiddleware.process_exception() can return twisted.python.failure.Failure objects as well as Request or Response objects.

This means you can return a Failure object to be handled by the errback by wrapping the exception in the Failure object.

This is cleaner than creating a fake Response object, see an example Middleware implementation that does this here: https://github.com/miguelsimon/site2graph/blob/master/site2graph/middlewares.py

The core idea:

from twisted.python.failure import Failure

class MyDownloaderMiddleware:

    def process_exception(self, request, exception, spider):
        return Failure(exception)
miguel
  • 11
  • 2
0

The __init__ method of the Rule class accepts a process_request parameter that you can use to attatch an errback to a request:

class MySpider(CrawlSpider):
    name = 'my-spider'

    rules = (
        Rule(
            # …
            process_request='process_request',
        ),
    )

    def process_request(self, request, response):
        return request.replace(errback=self.errback)

    def errback(self, failure):
        pass
Gallaecio
  • 3,620
  • 2
  • 25
  • 64