I have multiple large csv files. Each file has a size of 1 GB to 7 GB.
All my files doesn't have any headers. It is in this format. ID,col1,col2,col3
File 1 has this structure.
1,23,22,27
2,67,29,22
3,34,34,23
File 2 has this structure.
4,23,22,27
5,67,29,22
6,34,34,23
i.e. The ID is unique in all files.
I would like to merge these files into a single csv file and then sort the rows based on the ID column. The resulting file will be around 75 GB. And it starts with the ID 1
.
1,23,22,27
2,67,29,22
3,34,34,23
4,23,22,27
5,67,29,22
6,34,34,23
At the moment I'm doing like this.
import pandas as pd
CHUNK_SIZE = 10000000 # Number of Rows
output_file = 'combined.csv'
for csv_file_name in sorted_fnames:
chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
print(csv_file_name)
for chunk in chunk_container:
chunk.to_csv(output_file, mode="a", index=False)
And then I'm sorting the file like this.
sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv
However, the merging process is incredibly slow. It takes more than an hour to merge the file.
Note: I have only 16 GB RAM. That's why I'm using the chunking option.
Is there any fastest solution available?
Thanks