6

I am using scrapy to scrap a site

I had written a spider and fetched all the items from the page and saved to a csv file, and now i want to save the total execution time taken by scrapy to run the spider file, actually after spider execution is completed and when we have at at terminal it will display some results like starttime, endtime and so on .... so now in my program i need to calculate the total time taken by scrapy to run the spider and storing the total time some where....

Can anyone let me now how to do this through an example........

Thanks in advance...........

Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313

3 Answers3

6

This could be useful:

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.stats import stats
from datetime import datetime

def handle_spider_closed(spider, reason):
    print 'Spider closed:', spider.name, stats.get_stats(spider)
    print 'Work time:', datetime.now() - stats.get_stats(spider)['start_time']


dispatcher.connect(handle_spider_closed, signals.spider_closed)
warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • This could be helpful to me because I'm facing the same problem, but it's not clear to me where to put this code. Could you please give me a hint? – Max Jan 21 '14 at 23:11
  • You can put this code in any module, but must ensure the module is imported during spider startup – warvariuc Jan 22 '14 at 06:29
  • @warwaruk why didn't you use `stats.get_stats(spider)['finish_time']` instead of `datetime.now()` please? wouldn't that be more accurate? – William Kinaan Apr 21 '14 at 15:41
  • 1
    The answer was given in Jun 2012. Maybe there wasn't such key at that time. Feel free to post your own answer. – warvariuc Apr 21 '14 at 15:45
4

The easiest way I've found so far:

import scrapy

class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"

    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self, response):
        for title in response.css(".summary .question-hyperlink::text").getall():
            yield {"Title":title}

    def close(self, reason):
        start_time = self.crawler.stats.get_value('start_time')
        finish_time = self.crawler.stats.get_value('finish_time')
        print("Total run time: ", finish_time-start_time)
SIM
  • 21,997
  • 5
  • 37
  • 109
1

I'm quite a beginner but I did it in a bit simpler method and I hope it makes sense.

import datetime

then declare two global variables i.e self.starting_time and self.ending_time.

Inside the constructor of the spider class, set the starting time as

def __init__(self, name=None, **kwargs):
        self.start_time = datetime.datetime.now()

After that, use the closed method to find the difference between the ending and the starting. i.e

def closed(self, response):
   self.ending_time = datetime.datetime.now()
   duration = self.ending_time - self.starting_time
   print(duration)

That's pretty much of it. The closed method is called soon after the spider has ended the process. See the documentation here.

Erick Kondela
  • 149
  • 1
  • 5
  • I was going to post this answer as this is a cleaner solution in the newer version of scrapy. – Upendra Jan 03 '20 at 12:10
  • Btw, you don't need another variable for end time. You can take a shortcut by writing datetime.now() -self.starting_time – Upendra Jan 03 '20 at 12:18