I have a large file ~5TB (> 2000 columns and 250 mln rows) and want to join it with the other file which is pretty small ~10 GB (10 columns and 20 mln rows).
This is my current approach in Python:
- Load smaller file into memory and set index.
- Split larger file into 250 parts that each part fit into memory.
- Load part k of larger file (k=1,..., 250) into memory, set index and join with the smaller file based on id field (which is index).
- Save results to disk and load another part...
Code:
import pandas as pd
small_file = pd.read_csv('small_file.txt', sep='|', header=None,
usecols=[0,1,2,3,4,5,6,7,8,9], dtype=column_types_small_file)
small_file.set_index(2, inplace=True)
large_file_part_k= pd.read_csv('large_file_part_k.txt', sep='|', header=None)
large_file_part_k.set_index(2, inplace=True)
merged = pd.merge(small_file, large_file_part_k, how="inner", left_index=True, right_index=True)
merged.to_csv('join_results.csv', mode='a', header=False)
This works :) But processing of one part takes ~550 sec. Why? Loading one part into memory takes ~450 sec. :( I was trying to define data types for all columns when loading into memory but it made matters worse like 3-4 times longer.
Do you have any other options for this particular situation? The Machine which I use has 120 GB memory and 32 cores.