0

I want to apply a function of the form (the real function has 5 parameters but let's say it has only 2)

def func(text,model):
   return model[text]

to a dataframe in the following way:

model = something
df[col2]= df[col1].apply(lambda text: func(text, model)

This works fine but it is slow. This is a faster version that works fine unless the function is a lambda function.

def apply(func, data):
    with Pool(cpu_count()) as pool:
        return list(tqdm.tqdm(pool.imap(func, data), total=len(data)))

It throws the following error:

PicklingError: Can't pickle <function <lambda> at 0x7fe59c869e50>: attribute lookup <lambda> on __main__ failed

My solution: In order to apply this function faster I used the following trick: redefine the function so that the second parameter is default, and the value model is defined before the function is loaded.

model = something
def func(text,model=model):
  return model[text]

This works fine however, I feel like this is kinda ugly. I would like to know if there are other methods to accomplish this. I also tried creating a class

class Applyer:

def __init__(self,model):
  self.model = model
  
def func(self,text):
     return model[text]

If I create an instance and then apply the function like this:

model=something
applyer = Applyer(model)
apply(applyer.func,df[col1])

this works but it's even slower than using normal apply (without multiprocessing). Those are my two attempts.

Román
  • 101
  • 7

1 Answers1

1

You can partially evaluate your function with the fixed parameters and then call it with the missing variable parameter using functools.partial:

from functools import partial

partial_func = partial(func, model=some_model)

# now you can call it directly, providing the missing parameter(s):
partial_func(some_text)

# and you can apply it without a lambda:
df[col1].apply(partial_func)

This should already speed up the runtime. I haven't tried to parallelize this but since it's a simple function invocation, the approaches given in this question should work too.

Jan Wilamowski
  • 3,308
  • 2
  • 10
  • 23