0

I am trying to record the end time of my crawler. When scrapy finishes it dumps the scrapy stats to the log. I have been trying to use a pipeline to record these stats by using the close_spider method.

I have the stats as a field of my pipeline (self.stats).

def close_spider(self,spider):
    record_crawl_stats(self.stats)

The problem is that the 'finish_time' isn't available when this is called.

I am trying to find a way of getting a hold of the same stats as the ones that are dumped at the end.

(I could just get the datetime.now() for the finish time, but there are some other stats that I want access too that also are not available, such as the finish reason, and I believe number of items created)

Through research I found some answers to similar questions, were the scraper handles the spider closing. However, the code in both of these is not compatible with the current version of scrapy for various reasons.

https://stackoverflow.com/a/13799984/5078308

https://stackoverflow.com/a/11246025/5078308

Does anyone have any idea how to get similar functionality updated for the latest version or a different way of solving this?

Community
  • 1
  • 1
Mattias
  • 143
  • 2
  • 8
  • Did you initialize your pipline like the one example here? http://doc.scrapy.org/en/latest/topics/stats.html#topics-stats-usecases – GHajba Jul 06 '15 at 05:24
  • @GHajba yes, I did. I have access to the stats object. It's a timing issue, I can access values that are set earlier, such as start time, but when close spider is called finish time and various other values have not been set yet. – Mattias Jul 06 '15 at 08:17

1 Answers1

3

I found the answer by modifying some of the answers I linked to.

So basically, to get the finish time value stats need to be accessed after the spider has closed. It seems that this isn't true for the close_spider pipeline method. This is why you need to use the spider_closed signal sent by scrapy. This is all my code that deals with this scenario.

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals


def __init__(self, stats, settings):
    self.stats = stats
    dispatcher.connect(self.save_crawl_stats,signals.spider_closed)

@classmethod
def from_crawler(cls, crawler):
    return cls(crawler.stats,crawler.settings)

def save_crawl_stats(self):
    record_crawl_stats(self.cur,self.stats,self.crawl_instance)
Mattias
  • 143
  • 2
  • 8