Modifying a dataframe inside of multiprocessing

Question

I am using multiprocessing to run some pretty timely tasks concurrently creating a number of separate dataframes which I will merge into one later like so:

manager = Manager()
ns = manager.Namespace()

ns.df_one = data_format.init()  # this just creates a new dataframe with predefined columns
ns.df_two = data_format.init() # this just creates a new dataframe with predefined columns

p_search_one = Process(target=search_function_one, args=(ns,))
p_search_one.start()

p_search_two = Process(target=search_function_one, args=(ns,))
p_search_two.start()

p_search_one.join()
p_search_two.join()

pprint(f'search_one: {ns.df_one }')  # prints an empty dataframe

Then, in the function I am modifying the dataframe:

def search_function_one(ns):
    df = ns.df_one
    do_some_magic(df) # just adds rows to the dataframe, not returning anything just modifies in place
    pprint(f'df from ns: {ns.df_one}')  # prints an empty dataframe
    pprint(f'df from ns: {df}')  # prints an empty dataframe

I have also tried not making df a copy (?) of ns.df_one like so:

def search_function_one(ns):
    do_some_magic(ns.df_one) # just adds rows to the dataframe, not returning anything just modifies in place
    pprint(f'df from ns: {ns.df_one}')  # prints an empty dataframe

But that just prints an empty dataframe. Without using concurrency this works as expected, modifying the dataframe in place, but with concurrency it doesn't work.

I'm also wondering where do_some_magic is the issue as its in another file, but it doesn't make a new reference of the input parameter, so it doesn't do df = input_var it just accesses the input variable directly.

Am I doing something fundamentally wrong in how I'm managing my datatypes?

Doing `ns.df_one` does not pass some handle to a shared memory where the dataframe is stored, it passes the entire dataframe by value. Not only does this make your code inefficient, it also means that any changes that are done after getting the dataframe won't be automatically reflected in the namespace unless you do `ns.df_one = modified_df` explicitly. Check this [answer](https://stackoverflow.com/a/72817277/16310741), it explains the problem better and gives an alternate solution — Charchit Agarwal, Sep 21 '22 at 12:22
Thank you, I've read through that and now it makes a lot more sense! — , Sep 21 '22 at 12:47

Modifying a dataframe inside of multiprocessing

0 Answers0