0

I'm attempting to scrape articles on 100 companies, and I want to save the content from the multiple articles to a separate csv file for each company. I have the scraper and a csv export pipeline built, and it works fine, however, the spider opens a new csv file for each company (as it should) without closing the file opened for the previous company.

The csv files close after the spider closes, but because of the amount of data I am scraping for each company, the file sizes are significant and causes a strain on my machines memory, and cannot realistically scale, given that if I wanted to increase the number of companies (something I eventually want to do), I will, eventually run into an error for having too many files open at a time. Below is my csv exporter pipeline. I would like to find a way to close one csv file for the current company before moving on to the next company within the same spider:

I guess, theoretically, I could open the file for each article, write the content to new rows, then close it and reopen it again for the next article, but that will slow the spider down significantly. I'd like to keep the file open for a given company while the spider is still making its way through that company's articles, then close it when the spider moves on to the next company.

I'm sure there is a solution but I have not been able to figure one out. Would greatly appreciate help solving this.

class PerTickerCsvExportPipeline:
    """Distribute items across multiple CSV files according to their 'ticker' field"""

    def open_spider(self, spider):
        self.ticker_to_exporter = {}

    def close_spider(self, spider):
        for exporter in self.ticker_to_exporter.values():
            exporter.finish_exporting()

    def _exporter_for_item(self, item):
        ticker = item['ticker']
        if ticker not in self.ticker_to_exporter:
            f = open('{}_article_content.csv'.format(ticker), 'wb')
            exporter = CsvItemExporter(f)
            exporter.start_exporting()
            self.ticker_to_exporter[ticker] = exporter
        return self.ticker_to_exporter[ticker]

    def process_item(self, item, spider):
        exporter = self._exporter_for_item(item)
        exporter.export_item(item)
        return item
as_owl
  • 81
  • 6

1 Answers1

0

The problem probably is that you keep all the ItemExporters and files open until the spider closes. I suggest that you should try to close the CsvItemExporter and corresponding file for the previous company before you open a new one.

def open_spider(self, spider):
    self.ticker_to_exporter = {}
    self.files = []

def close_exporters(self):
    for ticker, exporter in self.ticker_to_exporter.items():
        exporter.finish_exporting()
        del self.ticker_to_exporter[ticker]

def close_files(self):
    for i, f in enumerate(self.files):
        f.close()
        del self.files[i]

def close_spider(self, spider):
    self.close_exporters()
    self.close_files()

def _exporter_for_item(self, item):
    ticker = item['ticker']
    if ticker not in self.ticker_to_exporter:
        self.close_exporters()
        self.close_files()
        f = open('{}_article_content.csv'.format(ticker), 'a')
        self.files.append(f)
        exporter = CsvItemExporter(f)
        exporter.start_exporting()
        self.ticker_to_exporter[ticker] = exporter
    return self.ticker_to_exporter[ticker]
Patrick Klein
  • 1,161
  • 3
  • 10
  • 23
  • I tried this, and the files still stay open while the spider is running. After that I tried adding the self.close_exporters() statement into the close_exporters() method and still run into the same issue unfortunately. – as_owl May 26 '20 at 19:31
  • @as_owl Perhaps you should close the files also. I extended my answer, please try if it works for you. – Patrick Klein May 26 '20 at 20:32
  • I believe I would need to append f to self.files in the self.files.append() statement, no? As of now the self.files.append() statement doesn't do anything... Testing this now and will let you know if it works or not! Thanks – as_owl May 26 '20 at 23:00
  • 1
    Patrick, the edits you made to the code, unfortunately, has a nasty side-effect, in that it only saves the most recently scraped article content to the csv, i.e. it overwrites the file with each iteration in the spider. I'm wondering if, instead of exporting the scraped content all at once, it may make since to use a csv writer to write new rows, however that may slow the spider down having to open and close the file and add new data to it for each iteration. – as_owl May 26 '20 at 23:17
  • @as_owl true, has probably to do with [the way the files](https://stackoverflow.com/questions/4706499/how-do-you-append-to-a-file-in-python) are opened. Also, Scrapy does not return items in the order you might expect since it does not process requests one by one but concurrent, so items from different categories come in unordered so files get reordered/overridden. And yes, you are right, you should append `f`. Don't know how I missed that, thanks for pointing it out! I've changed the open as well as the append part. – Patrick Klein May 26 '20 at 23:35