I'm trying to create a custom pipeline for a Scrapy
project that outputs the collected items to CSV files. In order to keep each file's size down I want to set a maximum number of rows that each file can have. Once the line limit has been reached in the current file a new file is created to continue outputting the items.
Luckily, I found a question where someone was looking to do the same thing. And there's an answer to that question that shows an example implementation.
I implemented the example implementation, but tweaked the way stats
were accessed to align with the current version of Scrapy
.
My current code
from scrapy.exporters import CsvItemExporter
import datetime
class PartitionedCsvPipeline(object):
def __init__(self, stats):
self.stats = stats
self.stats.set_value('item_scraped_count', 0)
self.base_filename = "site_{}.csv"
self.next_split = self.split_limit = 100
self.create_exporter()
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def create_exporter(self):
now = datetime.datetime.now()
datetime_stamp = now.strftime("%Y%m%d%H%M")
self.file = open(self.base_filename.format(datetime_stamp),'w+b')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def process_item(self, item, spider):
if self.stats.get_value('item_scraped_count') >= self.next_split:
self.next_split += self.split_limit
self.exporter.finish_exporting()
self.file.close()
self.create_exporter()
self.exporter.export_item(item)
self.stats.inc_value('item_scraped_count')
return item
The Problem
The pipeline does result in multiple files being output, but the files all have only 50 items instead of the 100 that's expected.
The Question
What am I doing wrong that's making the files half the size that's expected?