I have a large pandas DataFrame (> 10 GB) that I’d like to divide into smaller DFs and then sort each of them in parallel using multiprocessing
.
A simplified version of my code is as follows:
from multiprocessing import Process
import numpy as np
import pandas as pd
def sort_df(df):
df.sort_values(by=["b"], inplace=True)
print(df)
if __name__ == "__main__":
df = pd.DataFrame(np.random.rand(20, 3), columns=['a', 'b', 'c'])
gb = df.groupby(pd.cut(df["b"], 4))
# copy() is to suppress SettingWithCopyWarning
partitioned_dfs = [gb.get_group(g).copy() for g in gb.groups]
procs = []
for df in partitioned_dfs:
proc = Process(target=sort_df, args=(df,))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
It seems that my code uses too much memory when I use the large DF as the input.
- The
partitioned_dfs
takes up an equally large amount of memory as the input DF; - When spawning a new process, since a child process is created as a
copy of the parent process, it also has a copy of the input DF and
partitioned_dfs
in its address space, although it is responsible for only a subset of the entire DF.
I am wondering how to reduce the memory usage of the code. Is multiprocessing an appropriate choice here?