0

I recently started to use multiprocessing when mapping some complex functions on pandas dataframe. For example if I want to create a new column based on a value of some other column, I could do:

import seaborn as sns
iris = sns.load_dataset('iris')

import multiprocessing as mp

#example of a "complex function" returning some array
def function_1(val_):
    return [1] * round(val_)

with mp.Pool(mp.cpu_count()) as pool:
    iris['test_1'] = pool.map(function_1, iris['petal_length'])

This is much faster than using just apply with lambda function.

If I have a function which takes as an input multiple other columns of a dataframe (plus even some parameters), I could normally apply it like this:

def function_2(val_1, val_2, param_):
    return [param_] * round(val_1 + val_2)


iris['test_2'] = iris.apply(lambda x: function_2(x['petal_length'], x['sepal_width'], 3), axis=1)

How can I use multiprocessing for function_2 which takes more inputs than 1?

matt525252
  • 642
  • 1
  • 14
  • 21
  • maybe [this](https://stackoverflow.com/questions/5442910/python-multiprocessing-pool-map-for-multiple-arguments) is helpful. However, don't think `pool` is the right approach if you can write your function using vectorize numpy/pandas functions. – Quang Hoang Nov 15 '19 at 17:24
  • Thanks @QuangHoang. Yes pool is not great when vectorization is possible. However, my real function is quite complex, working with simulations and predictions and I am not sure if I would be able to use vectorized numpy/pandas functions. – matt525252 Nov 18 '19 at 09:56

1 Answers1

1

There may be a cleaner answer to this, but I typically would do the following:

import itertools
def function_2(input):
    val_1, val_2, param_ = input
    return [param_] * round(val_1 + val_2)

iris['test_2'] = pool.map(function_2, zip(iris['petal_length'], iris['sepal_width'], itertools.repeat(3)))

You may have to apply some additional formatting of your inputs to do it correctly.

Conner
  • 56
  • 3