1

How can I share a huge DataFrame between many processes without duplicating it, for each time the process has been created?

Take a look at this code:

from functools import partial
from multiprocessing import Pool
from pandas import DataFrame


def work(task, df):
    print(f'Working on task {task}, DataFrame located at {hex(id(df))}')


def main():
    huge_df = DataFrame(...)
    tasks = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}

    with Pool(processes=min(4, len(tasks))) as pool:
        pool.map(partial(work, df=huge_df), tasks.items())


if __name__ == '__main__':
    main()

To be clear, the huge_df is only needed for read-only operations.

Is there a better way to solve it?

user78910
  • 349
  • 2
  • 12
Hazan
  • 323
  • 3
  • 11
  • 1
    You can give a look at the [Dask library](https://dask.org/) – Hugolmn Jul 07 '20 at 15:12
  • 1
    If you're on Linux or macOS, that dataframe will actually be shared between the subprocesses due to copy-on-write memory. – AKX Jul 07 '20 at 15:21
  • [This answer](https://stackoverflow.com/a/14268804/5666087) might also be useful – jkr Jul 07 '20 at 15:25
  • @AKX: are you sure? Copy on write after fork is by page, and you have little control if any on Python and pandas memory management. I would not rely on just copy on write to use multiprocessing over a large Pandas dataframe. – Serge Ballesta Jul 07 '20 at 15:34
  • @SergeBallesta Not bulletproof sure, but if OP really only does read-only operations, only the object refcount values for each Python object should ever change, and for a large dataframe (let's say hundreds of megs or maybe up to gigs), that's a negligible overhead. – AKX Jul 07 '20 at 15:44
  • 1
    @AKX this is not at all how python shares data between processes. it _pickles_ it. meaning you have to transfer it from the main process to the subs. – acushner Jul 07 '20 at 16:16
  • @acushner Yes, if you send it down the pipe to a subprocess. If it's a global variable (or otherwise accessible) at the time the subprocess gets forked from the parent, it's available to the children. – AKX Jul 07 '20 at 16:17
  • Granted, I should have mentioned that in my earlier comment :) – AKX Jul 07 '20 at 16:18
  • 1
    If the df only has one data type (e.g. float) and your work can be done using numpy and not pandas, you can share the data by using `multiprocessing.Array` with `df.values` and use `np.frombuffer` in your `work` function to reinterpret the shared array as `np.ndarray`. – Niklas Mertsch Jul 07 '20 at 16:30
  • @Hugolmn I'm already using Dask, But this is not the case of the problem, because `dask.DataFrame.compute()` convert to `pandas.Dataframe`. – Hazan Jul 07 '20 at 16:31
  • @AKX I'm using Windows and `huge_df` is used for read-only operations, How could I pass it to `work` function and keep the same object in the same place in memory instead of duplicating? – Hazan Jul 07 '20 at 16:35
  • Adding to my comment: Not sure if wrapping an array in a DataFrame would copy the data. If not, you can wrap it in a DataFrame and have the memory shared between the processes. – Niklas Mertsch Jul 07 '20 at 16:37
  • @NiklasMertsch There is mixed data types (dates & floats) in the DataFrame. – Hazan Jul 07 '20 at 16:40
  • I'm still looking for help... – Hazan Jul 08 '20 at 13:19

0 Answers0