2

I have a large file ~5TB (> 2000 columns and 250 mln rows) and want to join it with the other file which is pretty small ~10 GB (10 columns and 20 mln rows).

This is my current approach in Python:

  1. Load smaller file into memory and set index.
  2. Split larger file into 250 parts that each part fit into memory.
  3. Load part k of larger file (k=1,..., 250) into memory, set index and join with the smaller file based on id field (which is index).
  4. Save results to disk and load another part...

Code:

import pandas as pd
small_file = pd.read_csv('small_file.txt', sep='|', header=None,
                     usecols=[0,1,2,3,4,5,6,7,8,9], dtype=column_types_small_file)
small_file.set_index(2, inplace=True)
large_file_part_k= pd.read_csv('large_file_part_k.txt', sep='|', header=None)
large_file_part_k.set_index(2, inplace=True)
merged = pd.merge(small_file, large_file_part_k, how="inner", left_index=True, right_index=True)
merged.to_csv('join_results.csv', mode='a', header=False)

This works :) But processing of one part takes ~550 sec. Why? Loading one part into memory takes ~450 sec. :( I was trying to define data types for all columns when loading into memory but it made matters worse like 3-4 times longer.

Do you have any other options for this particular situation? The Machine which I use has 120 GB memory and 32 cores.

max04
  • 5,315
  • 3
  • 13
  • 21
  • you may want to consider [Dask](http://dask.pydata.org/en/latest/dataframe.html) – DJK Sep 10 '18 at 12:59
  • you might want to refer [this](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas?rq=1) – anky Sep 10 '18 at 13:15
  • @DJK tried Dask but any action on this large file take a lot of time. I assume 5 TB is too much for dask. – max04 Sep 10 '18 at 13:54
  • @max04, I would think 5TB for any data manipulation task will be a challenge in terms of time .Python itself may be a bad choice. Sometimes it great that code can run while you sleep – DJK Sep 10 '18 at 15:02
  • @DJK yeah :) But when I wrote 'take a lot of time' I meant a few minutes or more for even head command using dask on this whole large file. – max04 Sep 10 '18 at 15:15

1 Answers1

0

I would consider converting your file to file formats more efficient than CSV. You may want to consider HD5 and Feather file formats, for example, which would give you a boost in read/write operations. See also this (closed) Stackoverflow question.

robertspierre
  • 3,218
  • 2
  • 31
  • 46
  • Thank you! I was trying it but converting process was really long so I stopped. I think I should try once more :) – max04 Sep 10 '18 at 12:45