0

I have multiple large csv files. Each file has a size of 1 GB to 7 GB.

All my files doesn't have any headers. It is in this format. ID,col1,col2,col3

File 1 has this structure.

1,23,22,27
2,67,29,22
3,34,34,23

File 2 has this structure.

4,23,22,27
5,67,29,22
6,34,34,23

i.e. The ID is unique in all files.

I would like to merge these files into a single csv file and then sort the rows based on the ID column. The resulting file will be around 75 GB. And it starts with the ID 1.

1,23,22,27
2,67,29,22
3,34,34,23
4,23,22,27
5,67,29,22
6,34,34,23

At the moment I'm doing like this.

import pandas as pd

CHUNK_SIZE = 10000000 # Number of Rows
output_file = 'combined.csv'

for csv_file_name in sorted_fnames:
    chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
    print(csv_file_name)
    for chunk in chunk_container:
        chunk.to_csv(output_file, mode="a", index=False)   

And then I'm sorting the file like this.

sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv

However, the merging process is incredibly slow. It takes more than an hour to merge the file.

Note: I have only 16 GB RAM. That's why I'm using the chunking option.

Is there any fastest solution available?

Thanks

John
  • 129
  • 12
  • did you try the solutions [here](https://stackoverflow.com/questions/56494140/how-do-i-combine-large-csv-files-in-python) ? – Aladin Apr 22 '22 at 13:15
  • Are the two files your joining already sorted? if so you can join/sort them in O(n+m) – Daniel Price Apr 22 '22 at 13:18
  • @Aladin I think my current solution and the solution you linked is same. I'm using my current code for few months. So I must have copied the code from there. – John Apr 22 '22 at 13:18
  • @DanielPrice 99.99% of the data is sorted. So I need to sort the file to avoid errors. – John Apr 22 '22 at 13:20
  • if you are just concatenating files , why you need pandas , just concat them as text files, – eshirvana Apr 22 '22 at 13:20
  • @eshirvana I'm using the pandas library to avoid memory issues. Do you know any memory efficient way to concat files without pandas? – John Apr 22 '22 at 13:22
  • 1
    You can read and write one line at a time, avoiding the memory issue – Daniel Price Apr 22 '22 at 13:23
  • 1
    A database is better suited for your needs. You can ingest all these CSV files into a sqlite database, export the sorted versions then delete the database. Check the "Data Analysis" use case on sqlite's website: https://www.sqlite.org/whentouse.html – Code Different Apr 22 '22 at 13:27
  • At some point, in order to sort successfully, you will need the merged files in memory. If you're running Windows you may have to investigate increasing your swap size. On a *nix type system you shouldn't need to do that albeit that it may run slowly – DarkKnight Apr 22 '22 at 13:29
  • @LancelotduLac I'm on ubuntu. And the sort command is optimised for memory by default. It automatically detects the memory and then sort it efficiently. So I think I'm okay with the sort command. My issue is mainly with merging files. – John Apr 22 '22 at 13:32
  • 1
    @John Create a temporary file and just read your input files a line at a time (skipping the header line on the second and subsequent files) and write those lines to the temp file. Then sort the temp file using the ubuntu sort – DarkKnight Apr 22 '22 at 13:46
  • 1
    @EdMorton No my files doesn't have blank lines. Also it doesn't have any header columns. Values are separated by commas. I have updated my question as you asked. Thanks – John Apr 23 '22 at 03:13

3 Answers3

1

on the second thought , you can use hdf5 structure that handles big data really well:

import pandas as pd

hdf_path = '_combined.h5'

with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
    for csv_file_name in sorted_fnames:
        store.append('data', pd.read_csv(csv_file_name), index=False)

you eventually can save it back to csv, if you wanted, but working with hdf5 would be more effeient

eshirvana
  • 23,227
  • 3
  • 22
  • 38
  • I'll give this a try. I have couple of questions. (1) The csv_file_name size should be less than the available memory right? (2) The variable output_file is not being used in your code. Is that intentional? – John Apr 22 '22 at 14:24
  • @John 1) give it a try like this , see how it goes ( i like to see how it performs) 2) no I copied from your code ,forgot ot remove, no need it in that piece of code. – eshirvana Apr 22 '22 at 14:26
  • Okay. I already have a job running that utilises most of the memory. It might take few more hours to complete. I'll do a test once it finishes and let you how it goes. One more thing. Can I use sort command with hdf5 format too? I mean like this. `sort --parallel=2 -t, -k1,1 -n combined.hdf5 > combined_sorted.hdf5` – John Apr 22 '22 at 14:40
  • Just tried your solution. But after few minutes, I got this error message. `ValueError: cannot match existing table structure for [false] on appending data` – John Apr 22 '22 at 16:53
  • @john , oh so your csv files don't have consistent structure. – eshirvana Apr 22 '22 at 18:17
  • Looks like it. But I had no issue when processing via pandas. – John Apr 23 '22 at 00:32
1

Found a fast solution. File processed in few minutes instead of hours.

The following assumes you don't have header row in all csv files. If you have header row, you need to remove that first. [Note: No need to fix if you have header row only in the first csv file]

import subprocess
sorted_fnames = ["1.csv",
                 "2.csv",
                 "3.csv"]

my_cmd = ['cat'] + sorted_fnames
with open('combined.csv', "w") as outfile:
    subprocess.run(my_cmd, stdout=outfile)

if you wanna sort, the you can use the sort command.

sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv
John
  • 129
  • 12
  • 1
    Even if you had a header, I think processing stuff as files rather than parsing whole CSVs is the right idea. Only caveat is it won’t work if the fields are not all in the same order in every file. – Casey Apr 22 '22 at 17:14
  • @ john, what you are doing here is what I meant in the comment originally – eshirvana Apr 22 '22 at 18:17
  • @eshirvana Yes, your comment was triggered me to search for a solution like this. – John Apr 23 '22 at 00:30
1

If this isn't all you need:

$ cat file1 file2
1,23,22,27
2,67,29,22
3,34,34,23
4,23,22,27
5,67,29,22
6,34,34,23

then edit your question to provide more useful sample input/output that truly demonstrates your requirements and where the above doesn't work for it.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • I think this is the same solution I posted 12 hours back except you are not using python. My project is in python and the sorted_fnames are not static. Hence I'm using a python solution. Thanks – John Apr 23 '22 at 05:21
  • I see, I didn't read your solution as I just saw multiple lines of `python` when a simple `cat` seemed to be all that's needed. You tagged your question with `sed` meaning you're open to using Unix tools other than `python` and if you do need to use `python` for some reason then calling `cat` from within `python` instead of just letting `python` read the files can't be the idiomatic `python` way to do this. – Ed Morton Apr 23 '22 at 12:18
  • 1
    No issues. Thanks for answer. – John Apr 23 '22 at 14:26