The task is to create a single function to
- parallelize the
pd.DataFrame.apply
function by splitting the rows into portions depending on the number of partitions requested by the users - allow a user-specified function to be passed into the parallized apply function
E.g. given a text input, I have tried:
from multiprocessing import Pool
import numpy as np
import pandas as pd
from nltk import word_tokenize
def apply_me(df_this):
return df_this[0].astype(str).apply(word_tokenize)
def parallelize_apply(df, apply_me, num_partitions):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_partitions)
df = pd.concat(pool.map(apply_me, df_split))
pool.close()
pool.join
return df
text = """Let's try something.
I have to go to sleep.
Today is June 18th and it is Muiriel's birthday!
Muiriel is 20 now.
The password is "Muiriel"."""
df = pd.DataFrame(text.split('\n'))
parallelize_apply(df, apply_me, 2)
[out]:
Out[8]:
0 [Let, 's, try, something, .]
1 [I, have, to, go, to, sleep, .]
2 [Today, is, June, 18th, and, it, is, Muiriel, ...
3 [Muiriel, is, 20, now, .]
4 [The, password, is, ``, Muiriel, '', .]
Name: 0, dtype: object
My question is is there's a way for me to put the word_tokenize()
function into the parallize_apply()
and not hard_code into the apply_me()
function ?
Simplifying the functions, I've tried:
from multiprocessing import Pool
import numpy as np
import pandas as pd
from nltk import word_tokenize
def apply_me(df_this, func):
return df_this[0].astype(str).apply(func)
def parallelize_apply(df, func, num_partitions):
df_split = np.array_split(df, num_partitions)
with Pool(num_partitions) as pool:
_apply_me = partial(apply_me, func=func)
df = pd.concat(pool.map(_apply_me, df_split))
pool.join
return df.tolist()
text = """Let's try something.
I have to go to sleep.
Today is June 18th and it is Muiriel's birthday!
Muiriel is 20 now.
The password is "Muiriel"."""
df = pd.DataFrame(text.split('\n'))
parallelize_apply(df, word_tokenize, 6)
It achieved the same output but there's still a need to have a middle-function apply_me
for the parallelize_apply
to work as desired. Is there a way to remove it?