Efficient merge for many huge csv files

Question

I have a script that takes all the csv files in a directory and merges them side-by-side, using an outer join. The problem is that my computer chokes (MemoryError) when I try to use it on the files I need to join (about two dozen files 6-12 Gb each). I am aware that itertools can be used to make loops more efficient, but I am unclear as to whether or how it could be applied to this situation. The other alternative I can think of is to install mySQL, learn the basics, and do this there. Obviously I'd rather do this in Python if possible because I'm already learning it. An R-based solution would also be acceptable.

Here is my code:

import os
import glob
import pandas as pd
os.chdir("\\path\\containing\\files")

files = glob.glob("*.csv")
sdf = pd.read_csv(files[0], sep=',')

for filename in files[1:]:
    df = pd.read_csv(filename, sep=',')
    sdf = pd.merge(sdf, df, how='outer', on=['Factor1', 'Factor2'])

Any advice for how to do this with files too big for my computer's memory would be greatly appreciated.

Kartik · Accepted Answer · 2016-08-06T03:19:29.607

6

Use HDF5, that in my opinion would suit your needs very well. It also handles out-of-core queries, so you won't have to face MemoryError.

import os
import glob
import pandas as pd
os.chdir("\\path\\containing\\files")

files = glob.glob("*.csv")
hdf_path = 'my_concatenated_file.h5'

with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
    # This compresses the final file by 5 using blosc. You can avoid that or
    # change it as per your needs.
    for filename in files:
        store.append('table_name', pd.read_csv(filename, sep=','), index=False)
    # Then create the indexes, if you need it
    store.create_table_index('table_name', columns=['Factor1', 'Factor2'], optlevel=9, kind='full')

edited Aug 06 '16 at 03:19

answered Aug 06 '16 at 01:37

Kartik

8,347
39
73

1

that threw a series of errors, the first of which was: `line 2885, in run_code exec(code_obj, self.user_global_ns, self.user_ns)` – Stonecraft Aug 06 '16 at 01:45
going csv -> hdf5 -> csv on the file sizes you're dealing with will take some time yea? i'd try to go csv -> csv if i could – Shawn K Aug 06 '16 at 01:49
Not true. HDF5 is at least 10 times faster than csv in my tests. Further the OP does not mention that the output should be in csv. Any other implementation will need quite elaborate algorithm involving loops, because each file that OP has to deal with is 6-12 GB. So unless OP is on a supercomputer, chances of having that much RAM is difficult. – Kartik Aug 06 '16 at 01:59
@JamesHanks, that traceback does not help at all. Probably you might need different tables for different files if they don't share the same column structure. My code appends all files to a single table. HDF5 is not column oriented, therefore you would need to get clever with it. Using the same indexes across different tables will let you query what you need. I am making some edits to my answer, in the hope that that will help. You might also want to look at: http://pandas.pydata.org/pandas-docs/stable/io.html#query-via-data-columns – Kartik Aug 06 '16 at 02:07
2

Ok thanks. Eventually the result will need to be in csv, but there is nothing to stop me from saving as such afterwards, right? And the files I need to put together DO share the same columns, I was getting that error because of an irrelevant file that was in the same directory. So it looks like your solution will work! – Stonecraft Aug 06 '16 at 02:26
1

Nope, nothing should stop you from going from HDF to csv. HDF just lets you do the querying out-of-core so you can join quite easily. For instance, you can read the main table in chunks, extract the values of 'Factor1' and 'Factor2' and get only the rows which contain those values from all other tables, merge them and write them to a csv file. You will note that HDF5 is much faster and manageable than csv. So, unless you have a compelling need to go back to csv, I think you are better off staying in HDF5. And HDF5 will soon have an ODBC driver: https://hdfgroup.org/wp/tag/hdf5-odbc-driver/ – Kartik Aug 06 '16 at 02:39
1

Since your files do share the same column structure, I suggest you also take a look at these answers: http://stackoverflow.com/questions/15798209/pandas-group-by-query-on-large-data-in-hdfstore and http://stackoverflow.com/questions/25459982/trouble-with-grouby-on-millions-of-keys-on-a-chunked-file-in-python-pandas/25471765#25471765 They will help you query your final table such that you only end up with rows sharing the same elements in 'Factor1' and 'Factor2', which you can easily reshape to get a side-by-side table for your final csv output. Also, use the previous version of my answer. – Kartik Aug 06 '16 at 02:44
1

Thanks for the additional info. I tried your code with my actual data and it was a lot faster (as in, it finished without error faster than the initial attempt crashed). I'm still figuring out how to work with h5 in R (my destination for the joined data), but there seems to be a lot of help available for that, so I should be good from here. Marked as answered :) – Stonecraft Aug 06 '16 at 03:44
You might have issues with h5 in R. I tried that once, then abandoned it. I saw only two solutions: go back to csv (or SQL) and switch completely to Python. With r2py, SciPy and statsmodels, I managed to stay with Python and use the awesome features of pandas and HDF. But that was some time ago, the battlefield might have changed since then. Good luck! And thanks for accepting my answer. – Kartik Aug 06 '16 at 03:50
Yup, I am having issues with it right now (seem to be memory related). I guess now I choose between staying in Python for the entire analysis and taking it as my cue to dabble in something else (mySQL). Thanks again! – Stonecraft Aug 06 '16 at 03:54
I'd stick with python, mainly because it is good with huge datasets, has reasonably powerful analysis libraries (scipy, statsmodels), h5 is faster than SQL (trust me, reading and writing my current project data in h5 is 1 hour faster than PostgreSQL, for every 100k rows), and r2py fills the remaining holes (which are very few and far between). And h5 will soon get ODBC drivers, so one can treat it like any random SQL DB. – Kartik Aug 06 '16 at 04:01
If your problems are memory related, it could be because your computer does not have enough memory to load all data at the same time. The free memory should be at least 1.5 times the size of the data you plan to load at a time, for simple calculations like sum and mean and more for complex calculations. That much is obvious. So use chunking. If you can, ask another question with you memory issues and we will try to find a solution. – Kartik Aug 06 '16 at 04:41

score 0 · Answer 2 · answered Aug 06 '16 at 01:24

0

There is a chance dask will be well-suited to your use. It might depend on what you want to do after the merge.

answered Aug 06 '16 at 01:24

Mike Graham

73,987
14
101
130

I'm going to save it into a single enormous csv. – Stonecraft Aug 06 '16 at 01:39

score 0 · Answer 3 · edited May 23 '17 at 12:25

0

You should be able to do this with python but i don't think reading the csv's at once will be the most efficient use of your memory.

How to read a CSV file from a stream and process each line as it is written?

edited May 23 '17 at 12:25

Community

1
1

answered Aug 06 '16 at 01:28

Shawn K

779
5
13

I'm not sure if I am understanding how the stream works, but I think it might be a problems because I'm not simply concatenating lines together, rather the presence or absence of the same key in different files changes how the lines will be aligned. – Stonecraft Aug 06 '16 at 01:48
you could do whatever you want as long as it fit in memory. only read in the bits you need, as you need them, and flush what is complete/matched to disk. – Shawn K Aug 06 '16 at 01:54

Efficient merge for many huge csv files

3 Answers3