Using Scrapy i'm implementing a CrawlSpider which will scrape all kinds of websites and hence, sometimes very slow ones which will produce a timeout eventually.
My problem is that if such a twisted.internet.error.TimeoutError
occurs, i want to trigger the errback of my spider. I don't want to raise this exception and i also don't want to return a dummy Response object which may would suggest that scraping was successful.
Note that i was already able to made all of this work, but only using a "dirty" workaround:
myspider.py (excerpt)
class MySpider(CrawlSpider):
name = 'my-spider'
rules = (
Rule(
link_extractor=LinkExtractor(unique=True),
callback='_my_callback', follow=True
),
)
def parse_start_url(self, response):
# (...)
def errback(self, failure):
log.warning('Failed scraping following link: {}'
.format(failure.request.url))
middlewares.py (excerpt)
from twisted.internet.error import DNSLookupError, TimeoutError
# (...)
class MyDownloaderMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
if (isinstance(exception, TimeoutError)
or (isinstance(exception, DNSLookupError))):
# just 2 examples of errors i want to catch
# set status=500 to enforce errback() call
return Response(request.url, status=500)
Settings should be fine with my custom Middleware already enabled.
Now as you can see by using return Response(request.url, status=500)
i can trigger my errback()
function in MySpider as desired. However, the status code 500 is very misleading because it's not only incorrect but technically i never receive any response at all.
So my question is, how can i trigger my errback()
function trough DownloaderMiddleware.process_exception()
in a clean way?
EDIT: I quickly figured it out that for similar exceptions like DNSLookupError
i want to have the same behaviour in place. I've updated the coding snippets accordingly.