Scrapy - Output to Multiple JSON files

Question

I am pretty new to Scrapy. I am looking into using it to crawl an entire website for links, in which I would output the items into multiple JSON files. So I could then upload them to Amazon Cloud Search for indexing. Is it possible to split the items into multiple files instead of having just one giant file in the end? From what I've read, the Item Exporters can only output to one file per spider. But I am only using one CrawlSpider for this task. It would be nice if I could set a limit to the number of items included in each file, like 500 or 1000.

Here is the code I have set up so far (based off the Dmoz.org used in the tutorial):

dmoz_spider.py

import scrapy

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import DmozItem

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/",
    ]

    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
       for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

items.py

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

Thanks for the help.

score 4 · Accepted Answer · edited May 23 '17 at 12:09

4

I don't think built-in feed exporters support writing into multiple files.

One option would be to export into a single file in jsonlines format basically, one JSON object per line which is convenient to pipe and split.

Then, separately, after the crawling is done, you can read the file in the desired chunks and write into separate JSON files.

So I could then upload them to Amazon Cloud Search for indexing.

Note that there is a direct Amazon S3 exporter (not sure it helps, just FYI).

edited May 23 '17 at 12:09

Community

1
1

answered Sep 30 '15 at 16:07

alecxe

462,703
120
1,088
1,195

I was also thinking along the lines of splitting it into separate JSON files after the crawling is complete. Sounds like the best option. Thanks for the suggestion. – liteshade06 Sep 30 '15 at 20:26
I also didn't even know there was an Amazon S3 exporter. I will definitely look into that as well. Thanks again! – liteshade06 Sep 30 '15 at 20:27

Jeroen Vermunt · Answer 2 · 2022-05-27T20:20:29.037

You can add a name to each item and use a custom pipeline to output to different json files. like so:

from scrapy.exporters import JsonItemExporter
from scrapy import signals

class MultiOutputExporter(object):

    @classmethod
    def from_crawler(cls, crawler):

        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):

        self.items = ['item1','item2']
        self.files = {}
        self.exporters = {}

        for item in self.items:

            self.files[item] = open(f'{item}.json', 'w+b')
            self.exporters[item] = JsonItemExporter(self.files[item])
            self.exporters[item].start_exporting()

    def spider_closed(self, spider):

        for item in self.items:
            self.exporters[item].finish_exporting()
            self.files[item].close()

    def process_item(self, item, spider):
        self.exporters[item.name].export_item()
        return item

Then add names to your items as follows:

class Item(scrapy.Item):

   name = 'item1'

Now enable the pipeline in scrapy.setting and voila.

Scrapy - Output to Multiple JSON files

2 Answers2