1

I have a working BaseSpider on Scrapy 0.20.0. But I'm trying to collect the number of found website URL's and print it as INFO when the spider is finished (closed). Problem is that I am not able to print this simple integer variable at the end of the session, and any print statement in the parse() or parse_item() functions are printed too early, long before.

I also looked at this question, but it seem somewhat outdated and it is unclear how to use it, properly. I.e. Where to put it (myspider.py, pipelines.py etc)?

Right now my spider-code is something like:

class MySpider(BaseSpider):
...
foundWebsites = 0
...
def parse(self, response):
    ...
    print "Found %d websites in this session.\n\n" % (self.foundWebsites)

def parse_item(self, response):
    ...
    if item['website']:
        self.foundWebsites += 1
    ...

And this is obviously not working as intended. Any better and simple ideas?

Community
  • 1
  • 1
not2qubit
  • 14,531
  • 8
  • 95
  • 135
  • Why don't you use the stats extension? `self.crawler.stats.inc_value('found_websites')`. – Blender Jan 04 '14 at 19:11
  • Because I'm counting collected items (preferably unique) and not crawled sites. You could imagine, replacing "website" by "email" or some other field, that may not even be present. (That's why I want to count them!) I guess one should be able to use the `spider_closed` signal, but I have no idea how to use it. – not2qubit Jan 04 '14 at 19:20
  • what holds you from simply copy and paste the code from the answer you referred to your spider? – Guy Gavriely Jan 04 '14 at 19:51
  • I suppose a misinterpretation of the 2nd answer, thinking that the 1st answer was outdated. I just tried the 1st and darn it, it works! I apologize for excessive posting and I will add a comment there. – not2qubit Jan 04 '14 at 20:03

1 Answers1

1

The first answer referred to works and there is no need to add anything else to pipelines.py. Just add "that answer" to your spider code like this:

# To use "spider_closed" we also need:
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

class MySpider(BaseSpider):
...
foundWebsites = 0
...
def parse(self, response):
    ...

def parse_item(self, response):
    ...
    if item['website']:
        self.foundWebsites += 1
    ...

def __init__(self):
    dispatcher.connect(self.spider_closed, signals.spider_closed)

def spider_closed(self, spider):
    if spider is not self:
        return
    print "Found %d websites in this session.\n\n" % (self.foundWebsites)
Community
  • 1
  • 1
not2qubit
  • 14,531
  • 8
  • 95
  • 135