I am using the following to check for (internet) connection errors in my spider.py
:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
def handle_error(self, failure):
if failure.check(DNSLookupError): # or failure.check(UnknownHostError):
request = failure.request
self.logger.error('DNSLookupError on: %s', request.url)
print("\nDNS Error! Please check your internet connection!\n")
elif failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on: %s', response.url)
print('\nSpider closed because of Connection issues!\n')
raise CloseSpider('Because of Connection issues!')
...
However, when the spider runs and the connection is down, I still get a Traceback (most recent call last):
messages. I would like to get rid of this by handling the error and shutting down the spider properly.
The output I get is:
2018-10-11 12:52:15 [NewAds] ERROR: DNSLookupError on: https://x.com
DNS Error! Please check your internet connection!
2018-10-11 12:52:15 [scrapy.core.scraper] ERROR: Error downloading <GET https://x.com>
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3.6/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: x.com.
From this you can notice the following:
- I am able to partially handle the (first?)
DNSLookupError
error, but... - shutting down the spider does not seem fast enough so the spider continue to try to download the URL, causing a different error (
ERROR: Error downloading
). - possibly causing a 2nd error:
twisted.internet.error.DNSLookupError:
?
How can I handle [scrapy.core.scraper] ERROR: Error downloading
and make sure the spider get shut down properly?
(Or: How can I check internet connection on spider startup?)