Parallelizing Pandas DataFrame function

Question

The task is to create a single function to

parallelize the pd.DataFrame.apply function by splitting the rows into portions depending on the number of partitions requested by the users
allow a user-specified function to be passed into the parallized apply function

E.g. given a text input, I have tried:

from multiprocessing import Pool

import numpy as np
import pandas as pd

from nltk import word_tokenize

def apply_me(df_this):
    return df_this[0].astype(str).apply(word_tokenize)

def parallelize_apply(df, apply_me, num_partitions):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_partitions)
    df = pd.concat(pool.map(apply_me, df_split))
    pool.close()
    pool.join
    return df

text = """Let's try something.
I have to go to sleep.
Today is June 18th and it is Muiriel's birthday!
Muiriel is 20 now.
The password is "Muiriel"."""

df = pd.DataFrame(text.split('\n'))

parallelize_apply(df, apply_me, 2)

[out]:

Out[8]:

0                         [Let, 's, try, something, .]
1                      [I, have, to, go, to, sleep, .]
2    [Today, is, June, 18th, and, it, is, Muiriel, ...
3                            [Muiriel, is, 20, now, .]
4              [The, password, is, ``, Muiriel, '', .]
Name: 0, dtype: object

My question is is there's a way for me to put the word_tokenize() function into the parallize_apply() and not hard_code into the apply_me() function ?

Simplifying the functions, I've tried:

from multiprocessing import Pool

import numpy as np
import pandas as pd

from nltk import word_tokenize

def apply_me(df_this, func):
    return df_this[0].astype(str).apply(func)

def parallelize_apply(df, func, num_partitions):
    df_split = np.array_split(df, num_partitions)
    with Pool(num_partitions) as pool:
        _apply_me = partial(apply_me, func=func)
        df = pd.concat(pool.map(_apply_me, df_split))
        pool.join
    return df.tolist()

text = """Let's try something.
I have to go to sleep.
Today is June 18th and it is Muiriel's birthday!
Muiriel is 20 now.
The password is "Muiriel"."""

df = pd.DataFrame(text.split('\n'))

parallelize_apply(df, word_tokenize, 6)

It achieved the same output but there's still a need to have a middle-function apply_me for the parallelize_apply to work as desired. Is there a way to remove it?

What do you aim to achieve by eliminating `apply_me`? You're already passing `word_tokenize` directly to `parallelize_apply`. — Ami Tavory, Apr 24 '18 at 06:40
I'm not sure why is there a need for `apply_me` to exist. When I replaced it with a lambda function inside `parallelize_apply`, multiprocessing isn't very happy about pooling lambda functions. — alvas, Apr 24 '18 at 06:43
Oh, that's seems like `picke/cpickle` crap. You should use `dill` or, better, `dill` with `joblib` [see here](https://stackoverflow.com/questions/25348532/can-python-pickle-lambda-functions). — Ami Tavory, Apr 24 '18 at 06:44
Is it passing serialized outputs across the Pool?! That's doesn't seem very secure (though not my concern of the project). Thanks for the suggestion! Let me try `joblib` =) — alvas, Apr 24 '18 at 06:47
Yeah, this is Python: there's no real parallelism, just multiple interpreters in different address spaces. To cross them, you have to pass things through, somehow. — Ami Tavory, Apr 24 '18 at 06:49
`apply()` is dead slow. Parallelizing it will only get you a 2-10x speedup after you imposed a 1000x slowdown on yourself. Don't do this at all. For your given example, use `df.str.split()` instead--it is a vectorized operation that does more or less the same thing but lightning quick. If you really can't do that, just use something like `concurrent.futures.ProcessPoolExecutor.map(word_tokenize, text)` before creating a Pandas structure. — John Zwinck, Apr 24 '18 at 07:21

Parallelizing Pandas DataFrame function

0 Answers0