I have function of processing a relatively large dataframe and run time takes quite a while. I was looking at ways of improving run time and I've come across multiprocessing pool. If I understood correctly, this should run the function for the equal chunks of the dataframe in parallel, which means it could potentially run quicker and save time.
So my function takes 4 different arguments, the last three of them are just mainly lookups, while the first one of the four is the data of interest dataframe. so looks something like this:
def functionExample(dataOfInterest, lookup1, lookup2, lookup3):
#do stuff with the data and lookups)
return output1, output2
So based on what I've read, I come to the below way of what I thought should work:
num_partitions = 4
num_cores = 4
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
Then to call the process (where mainly I couldn't figure it out), I've tried the below:
output1, output2= parallelize_dataframe(dataOfInterest, functionExample))
This returns the error:
functionExample() missing 3 required positional arguments: 'lookup1', 'lookup2', and 'lookup3'
Then I try adding the three arguments by doing the below:
output1, output2= parallelize_dataframe(dataOfInterest, functionExample(lookup1, lookup2, lookup3))
This returns the error below suggesting that it took the three arguments as the first three arguments of the function and missing the fourth instead of them being the last three arguments the previous error suggested they were missing:
functionExample() missing 1 required positional arguments: 'lookup1'
and then if I try feeding it the four arguments by doing the below:
output1, output2= parallelize_dataframe(dataOfInterest, functionExample(dataOfInterest, lookup1, lookup2, lookup3))
It returns the error below:
'tuple' object is not callable
I'm not quite sure which of the above is the way to do it, if any at all. Should it be taking all of the functions arguments including the desired dataframe. If so, why is it complaining about tuples?
Any help would be appreciated! Thanks.