How to share pandas DataFrame object between processes?

Question

This question has the same point of the link that I posted before.

( Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing? )

I'm getting nowhere with that since I faced the 'DataFrame' object sharing problem.

I simplified the sample code.

If there any professional to revise my code to share 'DataFrame' object between processes without Manager.list, Manager.dict, numpy sharedmem, I will very appreciate to her or him.

Here's the code.

#-*- coding: UTF-8 -*-'
import pandas as pd
import numpy as np
from multiprocessing import *
import multiprocessing.sharedctypes as sharedctypes
import ctypes

def add_new_derived_column(shared_df_obj):
    shared_df_obj.value['new_column']=shared_df_obj.value['A']+shared_df_obj.value['B'] / 2
    print shared_df_obj.value.head()
    '''
    "new_column" Generated!!!

          A         B  new_column
0 -0.545815 -0.179209   -0.635419
1  0.654273 -2.015285   -0.353370
2  0.865932 -0.943028    0.394418
3 -0.850136  0.464778   -0.617747
4 -1.077967 -1.127802   -1.641868
    '''

if __name__ == "__main__":

    dataframe = pd.DataFrame(np.random.randn(100000, 2), columns=['A', 'B'])

    # to shared DataFrame object, I use sharedctypes.RawValue
    shared_df_obj=sharedctypes.RawValue(ctypes.py_object, dataframe )

    # then I pass the "shared_df_obj" to Mulitiprocessing.Process object
    process=Process(target=add_new_derived_column, args=(shared_df_obj,))
    process.start()
    process.join()

    print shared_df_obj.value.head()
    '''
    "new_column" disappeared.
    the DataFrame object isn't shared.

          A         B
0 -0.545815 -0.179209
1  0.654273 -2.015285
2  0.865932 -0.943028
3 -0.850136  0.464778
4 -1.077967 -1.127802
    '''

what will you be trying to do? I assume there is some reason that you want to have the dataframe available for multiprocessing — James Tobin, Aug 12 '14 at 18:49
Is there a way to pass your dataframe through a Pipe, which wouldn't require a shared object? Another solution could be to pass your dataframe as an array with the data + 2 array for the labels, and reconstruct it after modifications. — CoMartel, Feb 02 '16 at 14:27
@CoMartel this is essentially what happens with `Queue` - the problem with that is (for large frames) the overhead associated with moving the frame through the pipe destroys any benefit from processing it in parallel (depending on what it is that you're doing exactly) — adam, Aug 13 '21 at 21:07

score 5 · Answer 1 · answered Mar 21 '18 at 15:07

You can use a Namespace Manager, the following code works as you expect.

#-*- coding: UTF-8 -*-'
import pandas as pd
import numpy as np
from multiprocessing import *
import multiprocessing.sharedctypes as sharedctypes
import ctypes

def add_new_derived_column(ns):
    dataframe2 = ns.df
    dataframe2['new_column']=dataframe2['A']+dataframe2['B'] / 2
    print (dataframe2.head())
    ns.df = dataframe2

if __name__ == "__main__":

    mgr = Manager()
    ns = mgr.Namespace()

    dataframe = pd.DataFrame(np.random.randn(100000, 2), columns=['A', 'B'])
    ns.df = dataframe
    print (dataframe.head())

    # then I pass the "shared_df_obj" to Mulitiprocessing.Process object
    process=Process(target=add_new_derived_column, args=(ns,))
    process.start()
    process.join()

    print (ns.df.head())

How to share pandas DataFrame object between processes?

1 Answers1

Linked