1

I am trying to save scraped items in separate json files, but I don't see any output files. The pipeline and the item is defined in the piplines.py and items.py files in the scrapy project folder. Do I have to call process_item() explicitly or will it be called automatically when I return item in scrape()? I enabled the pipeline in CrawlerProcess(settings={'ITEM_PIPELINES'}). Thanks.

The pipeline

import json,datetime

class JsonWriterPipeline(object):
    def process_item(self, item, spider):
        # return item
        fileName = datetime.datetime.now().strftime("%Y%m%d%H%M%S") + '.json'
        try:
            with open(fileName,'w') as fp:
                json.dump(dict(item),fp)
                return item
        except:
            return item
class ProjectItem(scrapy.Item):
   title = scrapy.Field()
   url = scrapy.Field()
class mySpider(CrawlSpider):
   name = 'mySPider'
   allowed_domains = ['allowedDOmain.org']
   start_urls = ['https://url.org']

def parse(self,response):
        monthSelector = '//div[@class="archives-column"]/ul/li/a[contains(text(),"November 2019")]/@href'
        monthLink = response.xpath(monthSelector).extract_first()
        yield response.follow(monthLink,callback=self.scrape)

def scrape(self,response):
        # get the links to all individual articles
        linkSelector = '.entry-title a::attr(href)'
        allLinks = response.css(linkSelector).extract()

        for link in allLinks:
            # item = articleItem()
            item = ProjectItem()
            item['url'] = link
            request = response.follow(link,callback=self.getContent)
            request.meta['item'] = item
            item = request.meta['item']
            yield item

        nextPageSelector = 'span.page-link a::attr(href)'
        nextPageLink = response.css(nextPageSelector).extract_first()
        yield response.follow(nextPageLink,callback=self.scrape)

def getContent(self,response):
        item = response.meta['item']
        TITLE_SELECTOR = '.entry-title ::text'
        item['title'] = response.css(TITLE_SELECTOR).extract_first()
        yield item
Amartya Barua
  • 155
  • 1
  • 12
  • In settings.py, have you added the JsonWriterPipeline class to ITEMPIPELINES? – NFB Nov 16 '19 at 16:01
  • Yes, I did but didn't work. – Amartya Barua Nov 16 '19 at 16:16
  • Where are you calling the scrape function from? This is usually done inside a spider class. Can you post the whole class? – NFB Nov 16 '19 at 16:21
  • Yup, scrape function is inside a spider class along with getContent and parse. – Amartya Barua Nov 16 '19 at 16:54
  • added the spider class (the parse, getContent and scrape functions are indented properly in the source file). – Amartya Barua Nov 16 '19 at 17:09
  • Have you tried removing the try/except inside process_item to make sure the filesystem is not at issue in any way? If it is you won't get any indication currently. Also, what results are you getting currently, if any? Are you certain the spider is yielding items? – NFB Nov 16 '19 at 17:26
  • Turns out that if on tries to run a spider from inside a script, the settings need to be imported using the method described in the following. https://stackoverflow.com/questions/25170682/running-scrapy-from-script-not-including-pipeline – Amartya Barua Nov 16 '19 at 19:24

2 Answers2

1

To settings.py, add:

ITEM_PIPELINES = {
        'myproject.pipelines.JsonWriterPipeline':100
}

where myproject is the name of your project/folder.

See the very last heading on this page : https://docs.scrapy.org/en/latest/topics/item-pipeline.html

NFB
  • 642
  • 8
  • 26
0

When running a spider inside a script, the settings need to be imported using the method described in the following. Running scrapy from script not including pipeline

Amartya Barua
  • 155
  • 1
  • 12