I'm trying to automate a file concatenation process in Python which works as effectively as the bash command line process I've been using. My bash CLI process uses awk
to merge the files, and the Python I've tried using for this uses pandas
.
For example, let's say I have a directory containing multiple CSV files named part_0.csv
, part_1.csv
, ..., part_n.csv
. Each file contains a header in the first line of the file. The bash CLI commands I use for this:
$ cd directory_containing_csv_files
$ mv part_0.csv merged.csv
$ awk 'FNR > 1' part*.csv > merged.csv
The Python/pandas code which does the same, but croaks when the total size gets big:
# read all the CSVs into a single file using concatenation (assumes same schema for all files)
combined_df = pd.concat([pd.read_csv(pth, header=0) for pth in csv_paths])
# write as single CSV file
combined_df.to_csv(dest_path, index=False, header=True)
The Python code seems to work well until it hits a size limit (the machine in question has 16GB RAM). The bash command line processing hasn't failed yet no matter what size the final file.
Maybe there's another Python approach that doesn't use pandas which is more memory efficient?