I'm trying to save some information between the last runned spider and current spider. To make this possible I found the Stats Collection supported by scrapy. My code bellow:
class StatsSpider(Spider):
name = 'stats'
def __init__(self, crawler, *args, **kwargs):
Spider.__init__(self, *args, **kwargs)
self.crawler = crawler
print self.crawler.stats.get_value('last_visited_url')
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_requests(self):
return [Request(url)
for url in ['http://www.google.com', 'http://www.yahoo.com']]
def parse(self, response):
self.crawler.stats.set_value('last_visited_url', response.url)
print'URL: %s' % response.url
When I run my spider, I can see via debug that stats variable is being refreshed with the new data, however, when I run my spider again (locally), the stats variable starts empty. How should I propertly run my spider in order to persist the data?
I'm running it on console:
scrapy runspider stats.py
EDIT : If you are running it on Scrapinghub you can use their collections api