4

I'm trying to save some information between the last runned spider and current spider. To make this possible I found the Stats Collection supported by scrapy. My code bellow:

class StatsSpider(Spider):
    name = 'stats'

    def __init__(self, crawler, *args, **kwargs):
        Spider.__init__(self, *args, **kwargs)
        self.crawler = crawler
        print self.crawler.stats.get_value('last_visited_url')

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def start_requests(self):
        return [Request(url)
                for url in ['http://www.google.com', 'http://www.yahoo.com']]

    def parse(self, response):
        self.crawler.stats.set_value('last_visited_url', response.url)
        print'URL: %s' % response.url

When I run my spider, I can see via debug that stats variable is being refreshed with the new data, however, when I run my spider again (locally), the stats variable starts empty. How should I propertly run my spider in order to persist the data?

I'm running it on console:

scrapy runspider stats.py

EDIT : If you are running it on Scrapinghub you can use their collections api

Rodrigo Ney
  • 353
  • 4
  • 20

1 Answers1

3

You need to save this data to disk in one way or another (in a file or database).

The crawler object your writing the data to only exists during the execution of your crawl. Once your spider finishes that object leaves memory and you lost your data.

I suggest loading the stats from your last run in init. Then updating them in parse like you are. Then hooking up the scrapy spider_closed signal to persist the data when the spider is done running.

If you need an example of spider_closed let me know and I'll update. But plenty of examples are readily available on the web.

Edit: I'll just give you an example: https://stackoverflow.com/a/12394371/2368836

Community
  • 1
  • 1
rocktheartsm4l
  • 2,129
  • 23
  • 38
  • So, am I forced to create a file? And if I run the same code via scrapping hub, the variable would keep the refence in memory? – Rodrigo Ney Aug 07 '15 at 16:24
  • 2
    With this approach you either need a file or a database. And I can't see any other way of doing it. Managing a local text file on a scrapinghub server seems problematic. So you may want to open a remote database connection instead. But that also seems like overkill. I think scrapinghub has support -- you might want to see if they have a suggestion. – rocktheartsm4l Aug 07 '15 at 16:30