2

Is it possible to make scrapy write to CSV files with not more than 5000 rows in each one? How can I give it a custom naming scheme? Am I supposed to modify CsvItemExporter?

Crypto
  • 1,217
  • 3
  • 17
  • 33

2 Answers2

2

Try this pipeline:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.exporters import CsvItemExporter

import datetime

class MyPipeline(object):

    def __init__(self, stats):
        self.stats = stats
        self.base_filename = "result/amazon_{}.csv"
        self.next_split = self.split_limit = 50000 # assuming you want to split 50000 items/csv
        self.create_exporter()  

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

    def create_exporter(self):
        now = datetime.datetime.now()
        datetime_stamp = now.strftime("%Y%m%d%H%M")
        self.file = open(self.base_filename.format(datetime_stamp),'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()       

    def process_item(self, item, spider):
        if (self.stats.get_stats()['item_scraped_count'] >= self.next_split):
            self.next_split += self.split_limit
            self.exporter.finish_exporting()
            self.file.close()
            self.create_exporter
        self.exporter.export_item(item)
        return item

Don't forget to add the pipeline to your setting:

ITEM_PIPELINES = {
   'myproject.pipelines.MyPipeline': 300,   
}
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
  • 1
    BTW: your example with modification was used in question [Scrapy custom pipeline outputting files half the size expected](https://stackoverflow.com/questions/62735616/scrapy-custom-pipeline-outputting-files-half-the-size-expected/) and I created minimal working code for new example in answer for this new question. I think it can be useful for people which find your answer. – furas Jul 05 '20 at 19:12
  • @furas i need your help with scrapy! how to reach you bro ? – αԋɱҽԃ αмєяιcαη Aug 21 '21 at 20:12
  • @αԋɱҽԃαмєяιcαη you can write furas@tlen.pl but I will be busy this week and without access to computer. – furas Aug 24 '21 at 12:28
  • @furas https://stackoverflow.com/questions/68915740/scrapy-splash-how-to-deal-with-onclick – αԋɱҽԃ αмєяιcαη Aug 31 '21 at 04:02
0

Are you using Linux?

The split command is very useful for this case.

split -l 5000  -d --additional-suffix .csv items.csv items-

See split --help for the options.

R. Max
  • 6,624
  • 1
  • 27
  • 34
  • Yes, I am. The site I'm scraping is huge, with millions of pages. I thought it might be better to do it from scrapy itself rather than running the split command from cron until the scraper finishes the job. – Crypto Jan 09 '14 at 07:56
  • @Crypto, in that case, you can subclass the `FeedExporter` class and modify the method `item_scraped` to keep a counter and reopen the exporter once it reaches the limit. This can be done by calling the methods `close_spider` and then `open_spider`. But you would need to take care of setting the filename and handling correctly the deferred returned by `close_spider`. Although, it can be tricky to adapt the exporter to your use case, a more simple approach would be creating a pipeline that does what you need without subclassing anything. – R. Max Jan 09 '14 at 13:16