What to use to load large file and join it with the smaller one in Python?

Question

I have a large file ~5TB (> 2000 columns and 250 mln rows) and want to join it with the other file which is pretty small ~10 GB (10 columns and 20 mln rows).

This is my current approach in Python:

Load smaller file into memory and set index.
Split larger file into 250 parts that each part fit into memory.
Load part k of larger file (k=1,..., 250) into memory, set index and join with the smaller file based on id field (which is index).
Save results to disk and load another part...

Code:

import pandas as pd
small_file = pd.read_csv('small_file.txt', sep='|', header=None,
                     usecols=[0,1,2,3,4,5,6,7,8,9], dtype=column_types_small_file)
small_file.set_index(2, inplace=True)
large_file_part_k= pd.read_csv('large_file_part_k.txt', sep='|', header=None)
large_file_part_k.set_index(2, inplace=True)
merged = pd.merge(small_file, large_file_part_k, how="inner", left_index=True, right_index=True)
merged.to_csv('join_results.csv', mode='a', header=False)

This works :) But processing of one part takes ~550 sec. Why? Loading one part into memory takes ~450 sec. :( I was trying to define data types for all columns when loading into memory but it made matters worse like 3-4 times longer.

Do you have any other options for this particular situation? The Machine which I use has 120 GB memory and 32 cores.

you may want to consider [Dask](http://dask.pydata.org/en/latest/dataframe.html) — DJK, Sep 10 '18 at 12:59
you might want to refer [this](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas?rq=1) — anky, Sep 10 '18 at 13:15
@DJK tried Dask but any action on this large file take a lot of time. I assume 5 TB is too much for dask. — max04, Sep 10 '18 at 13:54
@max04, I would think 5TB for any data manipulation task will be a challenge in terms of time .Python itself may be a bad choice. Sometimes it great that code can run while you sleep — DJK, Sep 10 '18 at 15:02
@DJK yeah :) But when I wrote 'take a lot of time' I meant a few minutes or more for even head command using dask on this whole large file. — max04, Sep 10 '18 at 15:15

score 0 · Answer 1 · answered Sep 10 '18 at 12:43

0

I would consider converting your file to file formats more efficient than CSV. You may want to consider HD5 and Feather file formats, for example, which would give you a boost in read/write operations. See also this (closed) Stackoverflow question.

answered Sep 10 '18 at 12:43

robertspierre

3,218
2
31
46

Thank you! I was trying it but converting process was really long so I stopped. I think I should try once more :) – max04 Sep 10 '18 at 12:45

What to use to load large file and join it with the smaller one in Python?

1 Answers1