0

I have function of processing a relatively large dataframe and run time takes quite a while. I was looking at ways of improving run time and I've come across multiprocessing pool. If I understood correctly, this should run the function for the equal chunks of the dataframe in parallel, which means it could potentially run quicker and save time.

So my function takes 4 different arguments, the last three of them are just mainly lookups, while the first one of the four is the data of interest dataframe. so looks something like this:

def functionExample(dataOfInterest, lookup1, lookup2, lookup3):
    #do stuff with the data and lookups)
    return output1, output2

So based on what I've read, I come to the below way of what I thought should work:

num_partitions = 4
num_cores = 4

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

Then to call the process (where mainly I couldn't figure it out), I've tried the below:

output1, output2= parallelize_dataframe(dataOfInterest, functionExample))

This returns the error:

functionExample() missing 3 required positional arguments: 'lookup1', 'lookup2', and 'lookup3'

Then I try adding the three arguments by doing the below:

output1, output2= parallelize_dataframe(dataOfInterest, functionExample(lookup1, lookup2, lookup3))

This returns the error below suggesting that it took the three arguments as the first three arguments of the function and missing the fourth instead of them being the last three arguments the previous error suggested they were missing:

functionExample() missing 1 required positional arguments: 'lookup1'

and then if I try feeding it the four arguments by doing the below:

output1, output2= parallelize_dataframe(dataOfInterest, functionExample(dataOfInterest, lookup1, lookup2, lookup3))

It returns the error below:

'tuple' object is not callable

I'm not quite sure which of the above is the way to do it, if any at all. Should it be taking all of the functions arguments including the desired dataframe. If so, why is it complaining about tuples?

Any help would be appreciated! Thanks.

user51
  • 8,843
  • 21
  • 79
  • 158
Mit
  • 679
  • 6
  • 17
  • Can you share a bit more information on this? What do the DataFrame and function look like? – AMC Jan 17 '20 at 20:23

3 Answers3

2

You can perform a partial binding of some arguments to create a new callable via functools.partial:

from functools import partial

output1, output2 = parallelize_dataframe(dataOfInterest,
                                         partial(functionExample, lookup1=lookup1, lookup2=lookup2, lookup3=lookup3))

Note that in the multiprocessing world, partial can be slow, so you may want to find a way to avoid the need to pass the arguments if they're large/expensive to pickle, assuming that's possible in your use case.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Thanks a lot. This solves it (it's running now, will test if it goes through and proves to be quicker), as to the partial, I think I'll go for the suggestion by @alec_djinn below which should do for a way around the partial. Cheers! – Mit Jan 17 '20 at 16:37
1

In each case, you are trying to call the function, rather than pass the arguments for when the function is called. What you need is a new callable that calls your original with the correct argument.

from functools import partial


output1, output2 = parallelize_dataframe(
    dataOfInterest,
    partial(functionExample, lookup1=x, lookup2=y, lookup3=z)
)
chepner
  • 497,756
  • 71
  • 530
  • 681
  • I was about to say that won't work; they're using `multiprocessing`, but you replaced the `lambda` (unpicklable) with `partial` (picklable, though potentially slow) between when I began this comment and now. :-) – ShadowRanger Jan 17 '20 at 15:46
  • Cool, I forgot pickling would be an issue. The old version was mainly to avoid assuming the OP knew the parameter names, which would preclude the use of keyword arguments. – chepner Jan 17 '20 at 15:50
1

You could simply modify your function definition to take predefined arguments, or make a function that call your original function using that params.

def functionExample(dataOfInterest, lookup1=x, lookup2=y, lookup3=z):
    #do stuff with the data and lookups)
    return output1, output2

or

def f(dataOfInterest):
    return functionExample(dataOfInterest, lookup1=x, lookup2=y, lookup3=z)

In this way, map() would work as you expect.

alec_djinn
  • 10,104
  • 8
  • 46
  • 71