11

How can I collect stats from within a spider callback?

Example

class MySpider(Spider):
     name = "myspider"
     start_urls = ["http://example.com"]

def parse(self, response):
    stats.set_value('foo', 'bar')

Not sure what to import or how to make stats available in general.

mattes
  • 8,936
  • 5
  • 48
  • 73

5 Answers5

17

Check out the stats page from the scrapy documentation. The documentation states that the Stats Collector, but it may be necessary to add from scrapy.stats import stats to your spider code to be able to do stuff with it.

EDIT: At the risk of blowing my own trumpet, if you were after a concrete example I posted an answer about how to collect failed urls.

EDIT2: After a lot of googling, apparently no imports are necessary. Just use self.crawler.stats.set_value()!

Community
  • 1
  • 1
Talvalin
  • 7,789
  • 2
  • 30
  • 40
  • hmm. it returns ``ImportError: cannot import name crawler``. ``File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/stats.py", line 1, in from scrapy.project import crawler`` – mattes Apr 10 '14 at 02:02
  • That's odd. I take it that your basic spider works without error? – Talvalin Apr 10 '14 at 07:25
  • yep. it works as long as I don't do anything with ``stats``. here is an example of how my spider looks like: https://gist.github.com/mattes/10367042 – mattes Apr 10 '14 at 10:44
  • I've edited my answer above. You can just use `self.crawler.stats.set_value()` in the `parse` method. – Talvalin Apr 10 '14 at 11:12
  • How do you reference the stats that were collected in a crawl? – michaelAdam Jul 16 '15 at 20:45
3

With scrapy 0.24 - stats I use it by the follow way:

class TopSearchesSpider(CrawlSpider):
    name = "topSearches"
    allowed_domains = ["...domain..."]

    start_urls = (
        'http://...domain...',
    )

    def __init__(self, stats):
        super(TopSearchesSpider, self).__init__()
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

    def parse_start_url(self, response):
        sel = Selector(response);
        url = response.url;

        self.stats.inc_value('pages_crawled')
        ...

super method is to call CrawlSpider constructor to execute its own code.

Franzi
  • 1,791
  • 23
  • 21
2

Add this inside your spider class

def my_parse(self, response): 
    print self.crawler.stats.get_stats()
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
1

if you want to use in other, you can:

spider.crawler.stats.get_stats()

0

if you want to get the scrapy stats after crawling as python object. This might help -

def spider_results(spider):
    results = []
    stats = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    def crawler_stats(*args, **kwargs):
        stats.append(kwargs['sender'].stats.get_stats())

    dispatcher.connect(crawler_results, signal=signals.item_scraped)

    dispatcher.connect(crawler_stats, signal=signals.spider_closed)

    process = CrawlerProcess()
    process.crawl(spider) # put our own spider class here
    process.start()  # the script will block here until the crawling is finished
    return results, stats

Hope it helps!

sid10on10
  • 1
  • 3