I'm attempting to scrape articles on 100 companies, and I want to save the content from the multiple articles to a separate csv file for each company. I have the scraper and a csv export pipeline built, and it works fine, however, the spider opens a new csv file for each company (as it should) without closing the file opened for the previous company.
The csv files close after the spider closes, but because of the amount of data I am scraping for each company, the file sizes are significant and causes a strain on my machines memory, and cannot realistically scale, given that if I wanted to increase the number of companies (something I eventually want to do), I will, eventually run into an error for having too many files open at a time. Below is my csv exporter pipeline. I would like to find a way to close one csv file for the current company before moving on to the next company within the same spider:
I guess, theoretically, I could open the file for each article, write the content to new rows, then close it and reopen it again for the next article, but that will slow the spider down significantly. I'd like to keep the file open for a given company while the spider is still making its way through that company's articles, then close it when the spider moves on to the next company.
I'm sure there is a solution but I have not been able to figure one out. Would greatly appreciate help solving this.
class PerTickerCsvExportPipeline:
"""Distribute items across multiple CSV files according to their 'ticker' field"""
def open_spider(self, spider):
self.ticker_to_exporter = {}
def close_spider(self, spider):
for exporter in self.ticker_to_exporter.values():
exporter.finish_exporting()
def _exporter_for_item(self, item):
ticker = item['ticker']
if ticker not in self.ticker_to_exporter:
f = open('{}_article_content.csv'.format(ticker), 'wb')
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.ticker_to_exporter[ticker] = exporter
return self.ticker_to_exporter[ticker]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item)
exporter.export_item(item)
return item