52

I'm a newbie of scrapy and it's amazing crawler framework i have known!

In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,
 'downloader/request_bytes': 46282582,
 'downloader/request_count': 92383,
 'downloader/request_method_count/GET': 92383,
 'downloader/response_bytes': 123766459,
 'downloader/response_count': 92382,
 'downloader/response_status_count/200': 92382,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),
 'item_scraped_count': 46191,
 'request_depth_max': 1,
 'scheduler/memory_enqueued': 92383,
 'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}

Is there any way to get more detail report? For example, show those failed URLs. Thanks!

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Joe Wu
  • 727
  • 1
  • 8
  • 14

9 Answers9

58

Yes, this is possible.

  • The code below adds a failed_urls list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required).
  • Next I added a handle that joins the list into a single string and adds it to the spider's stats when the spider is closed.
  • Based on your comments, it's possible to track Twisted errors, and some of the answers below give examples on how to handle that particular use case
  • The code has been updated to work with Scrapy 1.8. All thanks to this should go to Juliano Mendieta, since all I did was simply to add his suggested edits and confirm that the spider worked as intended.

from scrapy import Spider, signals

class MySpider(Spider):
    handle_httpstatus_list = [404] 
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.failed_urls = []

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.handle_spider_closed, signals.spider_closed)
        return spider

    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(self, reason):
        self.crawler.stats.set_value('failed_urls', ', '.join(self.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

Example output (note that the downloader/exception_count* stats will only appear if exceptions are actually thrown - I simulated them by trying to run the spider after I'd turned off my wireless adapter):

2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 15,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15,
     'downloader/request_bytes': 717,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 15209,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 2,
     'failed_url_count': 2,
     'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html'
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000),
     'log_count/DEBUG': 9,
     'log_count/ERROR': 2,
     'log_count/INFO': 4,
     'response_received_count': 3,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'spider_exceptions/NameError': 2,
     'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}
Talvalin
  • 7,789
  • 2
  • 30
  • 40
  • 2
    This no longer works. `exceptions.NameError: global name 'self' is not defined` error occurs. `BaseSpider` is now just `Spider` http://doc.scrapy.org/en/0.24/news.html?highlight=basespider#id2 https://github.com/scrapy/dirbot/blob/master/dirbot/spiders/dmoz.py but I can't find the fix to get your code working yet @Talvalin. – Mikeumus May 25 '15 at 22:58
  • As noted above, the code was updated on 2020/01/01 to confirm that it works with Scrapy 1.8 – Talvalin Jul 22 '20 at 06:41
20

Here's another example how to handle and collect 404 errors (checking github help pages):

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field


class GitHubLinkItem(Item):
    url = Field()
    referer = Field()
    status = Field()


class GithubHelpSpider(CrawlSpider):
    name = "github_help"
    allowed_domains = ["help.github.com"]
    start_urls = ["https://help.github.com", ]
    handle_httpstatus_list = [404]
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def parse_item(self, response):
        if response.status == 404:
            item = GitHubLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['status'] = response.status

            return item

Just run scrapy runspider with -o output.json and see list of items in the output.json file.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
17

Scrapy ignores 404 by default and does not parse it. If you are getting an error code 404 in response, you can handle this with a very easy way.

In settings.py, write:

HTTPERROR_ALLOWED_CODES = [404,403]

And then handle the response status code in your parse function:

def parse(self,response):
    if response.status == 404:
        #your action on error
daaawx
  • 3,273
  • 2
  • 17
  • 16
Pythonsguru
  • 424
  • 3
  • 11
14

The answers from @Talvalin and @alecxe helped me a great deal, but they do not seem to capture downloader events that do not generate a response object (for instance, twisted.internet.error.TimeoutError and twisted.web.http.PotentialDataLoss). These errors show up in the stats dump at the end of the run, but without any meta info.

As I found out here, the errors are tracked by the stats.py middleware, captured in the DownloaderStats class' process_exception method, and specifically in the ex_class variable, which increments each error type as necessary, and then dumps the counts at the end of the run.

To match such errors with information from the corresponding request object, you can add a unique id to each request (via request.meta), then pull it into the process_exception method of stats.py:

self.stats.set_value('downloader/my_errs/{0}'.format(request.meta), ex_class)

That will generate a unique string for each downloader-based error not accompanied by a response. You can then save the altered stats.py as something else (e.g. my_stats.py), add it to the downloadermiddlewares (with the right precedence), and disable the stock stats.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.my_stats.MyDownloaderStats': 850,
    'scrapy.downloadermiddleware.stats.DownloaderStats': None,
    }

The output at the end of the run looks like this (here using meta info where each request url is mapped to a group_id and member_id separated by a slash, like '0/14'):

{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web.http.PotentialDataLoss': 3,
 'downloader/my_errs/0/1': 'twisted.web.http.PotentialDataLoss',
 'downloader/my_errs/0/38': 'twisted.web.http.PotentialDataLoss',
 'downloader/my_errs/0/86': 'twisted.web.http.PotentialDataLoss',
 'downloader/request_bytes': 47583,
 'downloader/request_count': 133,
 'downloader/request_method_count/GET': 133,
 'downloader/response_bytes': 3416996,
 'downloader/response_count': 130,
 'downloader/response_status_count/200': 95,
 'downloader/response_status_count/301': 24,
 'downloader/response_status_count/302': 8,
 'downloader/response_status_count/500': 3,
 'finish_reason': 'finished'....}

This answer deals with non-downloader-based errors.

scharfmn
  • 3,561
  • 7
  • 38
  • 53
  • Exactly what I'm looking for. I think Scrapy should add this feature to provide convenient access to failure info like URL. – wlnirvana Jan 06 '15 at 01:40
  • 1
    Use `scrapy.downloadermiddlewares.stats` instead of deprecated on latest (1.0.5) version `scrapy.contrib.downloadermiddleware.stats` – El Ruso Mar 08 '16 at 14:43
  • what a nice answer,but may I asked,if there are some other type of errors,not just only http status code error or downloader error, is these are the all error we should deal with about net error? – Tarjintor Jul 22 '17 at 22:10
5

As of scrapy 0.24.6, the method suggested by alecxe won't catch errors with the start URLs. To record errors with the start URLs you need to override parse_start_urls. Adapting alexce's answer for this purpose, you'd get:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field

class GitHubLinkItem(Item):
    url = Field()
    referer = Field()
    status = Field()

class GithubHelpSpider(CrawlSpider):
    name = "github_help"
    allowed_domains = ["help.github.com"]
    start_urls = ["https://help.github.com", ]
    handle_httpstatus_list = [404]
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def parse_start_url(self, response):
        return self.handle_response(response)

    def parse_item(self, response):
        return self.handle_response(response)

    def handle_response(self, response):
        if response.status == 404:
            item = GitHubLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['status'] = response.status

            return item
Community
  • 1
  • 1
Louis
  • 146,715
  • 28
  • 274
  • 320
5

This is an update on this question. I ran in to a similar problem and needed to use the scrapy signals to call a function in my pipeline. I have edited @Talvalin's code, but wanted to make an answer just for some more clarity.

Basically, you should add in self as an argument for handle_spider_closed. You should also call the dispatcher in init so that you can pass the spider instance (self) to the handleing method.

from scrapy.spider import Spider
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

class MySpider(Spider):
    handle_httpstatus_list = [404] 
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, category=None):
        self.failed_urls = []
        # the dispatcher is now called in init
        dispatcher.connect(self.handle_spider_closed,signals.spider_closed) 


    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(self, spider, reason): # added self 
        self.crawler.stats.set_value('failed_urls',','.join(spider.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__,  exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

I hope this helps anyone with the same problem in the future.

Mattias
  • 143
  • 2
  • 8
4

You can capture failed urls in two ways.

  1. Define scrapy request with errback

    class TestSpider(scrapy.Spider):
        def start_requests(self):
            yield scrapy.Request(url, callback=self.parse, errback=self.errback)
    
        def errback(self, failure):
            '''handle failed url (failure.request.url)'''
            pass
    
  2. Use signals.item_dropped

    class TestSpider(scrapy.Spider):
        def __init__(self):
            crawler.signals.connect(self.request_dropped, signal=signals.request_dropped)
    
        def request_dropped(self, request, spider):
            '''handle failed url (request.url)'''
            pass
    

[!Notice] Scrapy request with errback can not catch some auto retry failure, like connection error, RETRY_HTTP_CODES in settings.

jdxin0
  • 83
  • 2
  • 8
  • How would you gracefully close the spider in those circumstances? – not2qubit Oct 11 '18 at 08:56
  • @not2qubit What's circumstances? – jdxin0 Oct 11 '18 at 13:32
  • There seem to be some funny stuff going on with *Twisted* so that I keep getting [this error](https://stackoverflow.com/questions/52757819/how-to-handle-connection-or-download-error-in-scrapy) even though I have already ordered the spider to shut down. So perhaps there is a better method to shut down the spider, before it retries, or even before that. – not2qubit Oct 11 '18 at 13:37
  • @not2qubit Check `self.crawler.crawling` in errback and request_dropped. If your shut down the spider, `self.crawler.crawling` will be `False`. – jdxin0 Oct 12 '18 at 04:14
3

In addition to some of these answers, if you want to track Twisted errors, I would take a look at using the Request object's errback parameter, on which you can set a callback function to be called with the Twisted Failure on a request failure. In addition to the url, this method can allow you to track the type of failure.

You can then log the urls by using: failure.request.url (where failure is the Twisted Failure object passed into errback).

# these would be in a Spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse,
                                  errback=self.handle_error)

def handle_error(self, failure):
    url = failure.request.url
    logging.error('Failure type: %s, URL: %s', failure.type,
                                               url)

The Scrapy docs give a full example of how this can be done, except that the calls to the Scrapy logger are now depreciated, so I've adapted my example to use Python's built in logging):

https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks

Michael
  • 375
  • 4
  • 7
3

Basically Scrapy Ignores 404 Error by Default, It was defined in httperror middleware.

So, Add HTTPERROR_ALLOW_ALL = True to your settings file.

After this you can access response.status through your parse function.

You can handle it like this.

def parse(self,response):
    if response.status==404:
        print(response.status)
    else:
        do something
Mohan B E
  • 31
  • 4