How to perform a CSV concatenation in Python as effectively as this awk command?

Question

I'm trying to automate a file concatenation process in Python which works as effectively as the bash command line process I've been using. My bash CLI process uses awk to merge the files, and the Python I've tried using for this uses pandas.

For example, let's say I have a directory containing multiple CSV files named part_0.csv, part_1.csv, ..., part_n.csv. Each file contains a header in the first line of the file. The bash CLI commands I use for this:

$ cd directory_containing_csv_files
$ mv part_0.csv merged.csv
$ awk 'FNR > 1' part*.csv > merged.csv

The Python/pandas code which does the same, but croaks when the total size gets big:

# read all the CSVs into a single file using concatenation (assumes same schema for all files)
combined_df = pd.concat([pd.read_csv(pth, header=0) for pth in csv_paths])

# write as single CSV file
combined_df.to_csv(dest_path, index=False, header=True)

The Python code seems to work well until it hits a size limit (the machine in question has 16GB RAM). The bash command line processing hasn't failed yet no matter what size the final file.

Maybe there's another Python approach that doesn't use pandas which is more memory efficient?

Agreed, this is simplest, although it's not portable (you won't find awk on a Windows machine). — James Adams, Dec 22 '20 at 21:37
Does this answer your question? [how to merge 200 csv files in Python](https://stackoverflow.com/questions/2512386/how-to-merge-200-csv-files-in-python) — noah, Dec 22 '20 at 21:37
Thanks, @noah. That may have been where I first [found the idea to do what I've done using pandas](https://stackoverflow.com/a/40665364/85248) — James Adams, Dec 22 '20 at 21:42
And actually, you don't need the `csv` module, since you don't have to actually parse the csvs's. — juanpa.arrivillaga, Dec 22 '20 at 21:50

Quang Hoang · Answer 1 · 2020-12-22T21:48:30.180

2

You can try to_csv with a file stream:

first = True

# open a file
with open('a.csv', 'w') as f:

    # loop through the csv's
    for pth in csv_paths:
        df = pd.read_csv(pth)
        
        # write to the stream
        # only write the first header
        df.to_csv(f, header=first, index=False)

        # toggle the first header
        if first: first = False

edited Dec 22 '20 at 21:48

answered Dec 22 '20 at 21:40

Quang Hoang

146,074
10
56
74

Thanks @Quang Hong. Maybe the `first` variable should instead be initialized to True, and set to False in the toggle conditional? – James Adams Dec 22 '20 at 21:46
1

@JamesAdams Yes, totally. Updated. Thanks. – Quang Hoang Dec 22 '20 at 21:48

score 1 · Answer 2 · answered Dec 23 '20 at 08:41

Maybe there's another Python approach that doesn't use pandas which is more memory efficient?

Using pandas means you are parsing your csvs which is not neccessary if you want just to concat all but first of lines. Your task seems fit for fileinput built-in module. Create file merger.py as follows:

import fileinput
for line in fileinput.input():
    if not fileinput.isfirstline():
        print(line, end='')

then replace

awk 'FNR > 1' part*.csv > merged.csv

using

python3 merger.py part*.csv > merged.csv

Note that as fileinput.input does left newlines intact and print function by default adds newline at end so I informed print not to do so (end=''), as otherwise you will get superfluous newlines (blank lines interleaved between your data).

(tested in python 3.7.3)

How to perform a CSV concatenation in Python as effectively as this awk command?

2 Answers2