Shared a huge "pandas.DataFrame" object between many processes

Question

How can I share a huge DataFrame between many processes without duplicating it, for each time the process has been created?

Take a look at this code:

from functools import partial
from multiprocessing import Pool
from pandas import DataFrame


def work(task, df):
    print(f'Working on task {task}, DataFrame located at {hex(id(df))}')


def main():
    huge_df = DataFrame(...)
    tasks = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}

    with Pool(processes=min(4, len(tasks))) as pool:
        pool.map(partial(work, df=huge_df), tasks.items())


if __name__ == '__main__':
    main()

To be clear, the huge_df is only needed for read-only operations.

Is there a better way to solve it?

You can give a look at the [Dask library](https://dask.org/) — Hugolmn, Jul 07 '20 at 15:12
If you're on Linux or macOS, that dataframe will actually be shared between the subprocesses due to copy-on-write memory. — AKX, Jul 07 '20 at 15:21
[This answer](https://stackoverflow.com/a/14268804/5666087) might also be useful — jkr, Jul 07 '20 at 15:25
@AKX: are you sure? Copy on write after fork is by page, and you have little control if any on Python and pandas memory management. I would not rely on just copy on write to use multiprocessing over a large Pandas dataframe. — Serge Ballesta, Jul 07 '20 at 15:34
@SergeBallesta Not bulletproof sure, but if OP really only does read-only operations, only the object refcount values for each Python object should ever change, and for a large dataframe (let's say hundreds of megs or maybe up to gigs), that's a negligible overhead. — AKX, Jul 07 '20 at 15:44
@AKX this is not at all how python shares data between processes. it _pickles_ it. meaning you have to transfer it from the main process to the subs. — acushner, Jul 07 '20 at 16:16
@acushner Yes, if you send it down the pipe to a subprocess. If it's a global variable (or otherwise accessible) at the time the subprocess gets forked from the parent, it's available to the children. — AKX, Jul 07 '20 at 16:17
Granted, I should have mentioned that in my earlier comment :) — AKX, Jul 07 '20 at 16:18
If the df only has one data type (e.g. float) and your work can be done using numpy and not pandas, you can share the data by using `multiprocessing.Array` with `df.values` and use `np.frombuffer` in your `work` function to reinterpret the shared array as `np.ndarray`. — Niklas Mertsch, Jul 07 '20 at 16:30
@Hugolmn I'm already using Dask, But this is not the case of the problem, because `dask.DataFrame.compute()` convert to `pandas.Dataframe`. — Hazan, Jul 07 '20 at 16:31
@AKX I'm using Windows and `huge_df` is used for read-only operations, How could I pass it to `work` function and keep the same object in the same place in memory instead of duplicating? — Hazan, Jul 07 '20 at 16:35
Adding to my comment: Not sure if wrapping an array in a DataFrame would copy the data. If not, you can wrap it in a DataFrame and have the memory shared between the processes. — Niklas Mertsch, Jul 07 '20 at 16:37
@NiklasMertsch There is mixed data types (dates & floats) in the DataFrame. — Hazan, Jul 07 '20 at 16:40

Shared a huge "pandas.DataFrame" object between many processes

0 Answers0