1

I have a large pandas DataFrame (> 10 GB) that I’d like to divide into smaller DFs and then sort each of them in parallel using multiprocessing.

A simplified version of my code is as follows:

from multiprocessing import Process

import numpy as np
import pandas as pd


def sort_df(df):
    df.sort_values(by=["b"], inplace=True)
    print(df)


if __name__ == "__main__":
    df = pd.DataFrame(np.random.rand(20, 3), columns=['a', 'b', 'c'])
    gb = df.groupby(pd.cut(df["b"], 4))
    # copy() is to suppress SettingWithCopyWarning
    partitioned_dfs = [gb.get_group(g).copy() for g in gb.groups]

    procs = []
    for df in partitioned_dfs:
        proc = Process(target=sort_df, args=(df,))
        procs.append(proc)
        proc.start()

    for proc in procs:
        proc.join()

It seems that my code uses too much memory when I use the large DF as the input.

  1. The partitioned_dfs takes up an equally large amount of memory as the input DF;
  2. When spawning a new process, since a child process is created as a copy of the parent process, it also has a copy of the input DF and partitioned_dfs in its address space, although it is responsible for only a subset of the entire DF.

I am wondering how to reduce the memory usage of the code. Is multiprocessing an appropriate choice here?

s9527
  • 414
  • 1
  • 4
  • 14
  • I am highly doubting this will result in any faster execution. You need to divide your data, sort it independently, wait till the slowest sort is done and join them back again. The copy also copies data around ... you sure thats faster then simply trusting pandas to do it as fast as possible? – Patrick Artner May 18 '20 at 08:35
  • Possible duplicate: [sorting-in-pandas-for-large-datasets](https://stackoverflow.com/questions/21271727/sorting-in-pandas-for-large-datasets) – Patrick Artner May 18 '20 at 08:36
  • 1
    @PatrickArtner, I'm not sure, I'm simply experimenting it to see it could be any faster. In my case it is acceptable not to join the divided sorted data. – s9527 May 18 '20 at 08:43
  • Maybe add a guesstimate how big your data is - applicable strategies might need to know that. I am only dabbling in pd with small datasets, never had memory problems. – Patrick Artner May 18 '20 at 08:48

0 Answers0