I am using multiprocessing to run some pretty timely tasks concurrently creating a number of separate dataframes which I will merge into one later like so:
manager = Manager()
ns = manager.Namespace()
ns.df_one = data_format.init() # this just creates a new dataframe with predefined columns
ns.df_two = data_format.init() # this just creates a new dataframe with predefined columns
p_search_one = Process(target=search_function_one, args=(ns,))
p_search_one.start()
p_search_two = Process(target=search_function_one, args=(ns,))
p_search_two.start()
p_search_one.join()
p_search_two.join()
pprint(f'search_one: {ns.df_one }') # prints an empty dataframe
Then, in the function I am modifying the dataframe:
def search_function_one(ns):
df = ns.df_one
do_some_magic(df) # just adds rows to the dataframe, not returning anything just modifies in place
pprint(f'df from ns: {ns.df_one}') # prints an empty dataframe
pprint(f'df from ns: {df}') # prints an empty dataframe
I have also tried not making df a copy (?) of ns.df_one like so:
def search_function_one(ns):
do_some_magic(ns.df_one) # just adds rows to the dataframe, not returning anything just modifies in place
pprint(f'df from ns: {ns.df_one}') # prints an empty dataframe
But that just prints an empty dataframe. Without using concurrency this works as expected, modifying the dataframe in place, but with concurrency it doesn't work.
I'm also wondering where do_some_magic
is the issue as its in another file, but it doesn't make a new reference of the input parameter, so it doesn't do df = input_var
it just accesses the input variable directly.
Am I doing something fundamentally wrong in how I'm managing my datatypes?