I need to do some computation on different slices of some big dataframes.
Suppose I have 3 big dataframes df1
, df2
and df3
.
Each of which has a "Date"
column.
I need to do some computation on these dataframes, based on date slices and since each iteration is independent from the other iteration, I need to do these iterations concurrently.
df1 # a big dataframe
df2 # a big dataframe
df3 # a big dataframe
So I define my desired function and in each child process, first a slice of df1
, df2
, df3
is created, within the process, then other computations go on.
Since df1
,df2
and df3
are global dataframes, I need to specify them as arguments in my function. Otherwise it won't be recognized.
Something like below:
slices = [ '2020-04-11', '2020-04-12', '2020-04-13', ]
# a list of dates to get sliced further
def my_func(slice,df1=df1,df2=df2,df3=df3):
sliced_df1 = df1[df1.Date > slice]
sliced_df2 = df2[df2.Date < slice]
sliced_df3 = df3[df3.Date >= slice]
#
# other computations
# ...
#
return desired_df
The concurrent processing is configured as below :
import psutil
pool = multiprocess.Pool(psutil.cpu_count(logical=False))
final_df = pool.map(my_func,[slice for slice in slices])
pool.close()
final_df = pd.concat(final_df, ignore_index = True)
However, it seems that only one core goes up upon the execution.
I suppose that since each child process wants to access the global dataframes df1
, df2
and df3
, there should be a shared memory for child process and as I searched through net, I guess I have to use the multiprocessing.manager()
, but I am not sure how to use it or if I am right about using it?
I am actually new to the concept of concurrent processing and I appreciate if someone can help.
PS: It seems that my question is similar to this post. However, it hasn't an accepted answer.