Difficulty combining csv files into a single file

Question

My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:

Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files

import pandas as pd
import numpy as np
import os, sys
import glob

os.chdir('c:\\folder'

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])

combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')

When I run this, I receive the following message: MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.

I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?

Thank you!

do you want just to get merged file or work with it via pandas? — dukkee, Jan 25 '21 at 20:49
A virtual machine might be an option, but remember that pandas stores data in memory. So you'll have to have a machine with a good amount of RAM. If you don't need all the data in each csv, you might try preprocessing the csv (perhaps with a line-by-line operation). Then you'll have a smaller data footprint to work with. — Docuemada, Jan 25 '21 at 20:53

score 2 · Answer 1 · answered Jan 25 '21 at 21:05

2

This is a duplicate of how to merge 200 csv files in Python.

Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)

In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)

answered Jan 25 '21 at 21:05

Semmel

575
2
8

1

On linux 'cat *.csv > new.csv' would do the job. On windows 'type *.csv > new.csv' – Arthur Harduim Jan 25 '21 at 21:36
@Semmel - thank you for the link to previous discussion thread. So it looks like the first answer (from Wisty) seemed to work. Is that the 'streaming' solution you were alluding to? Much appreciated :) – tlazas912 Jan 25 '21 at 21:43

Difficulty combining csv files into a single file

1 Answers1