Multiprocessing in pandas

Question

Is it possible to partition a pandas dataframe to do multiprocessing?

Specifically, my DataFrames are simply too big and take several minutes to run even one transformation on a single processor.

I know, I could do this in Spark but a lot of code has already been written, so preferably I would like to stick with what I have and get parallel functionality.

take a look at [Dask project](http://dask.pydata.org/en/latest/index.html) — MaxU - stand with Ukraine, May 27 '16 at 20:15
What exactly are you trying to do? multiprocessing seem to work with pandas - http://stackoverflow.com/questions/26187759/parallelize-apply-after-pandas-groupby — Oleg Medvedyev, May 27 '16 at 20:20
Hey Torrinos, it seems like the answers were specific to applying on a groupby object. I have a bunch of apply statements over rows on a whole dataframe. Instead of running the whole dataframe on a single processor, I would like to parallelize it over multiple processors. — Michael Tamillow, May 29 '16 at 02:50
Hey Max, dask seems promising, but is it in any way connected to pandas? If it's a child of pandas DataFrame then I can use it. Otherwise, it's too dangerous - it will probably blow up a large portion of my code. — Michael Tamillow, May 29 '16 at 13:47

score 4 · Accepted Answer · edited May 23 '17 at 12:17

Slightly modifying https://stackoverflow.com/a/29281494/5351271 I could get a solution to work over rows.

from multiprocessing import Pool, cpu_count

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pandas.concat(ret_list)

def apply_row_foo(input_df):
    return input_df.apply((row_foo), axis=1)

n_chunks = 10

grouped = df.groupby(df.index // n_chunks)
applyParallel(grouped, apply_row_foo)

If the index is not merely a row number, just group by np.arange(len(df)) // n_chunks

Decidedly not elegant, but worked in my use case.

Multiprocessing in pandas

1 Answers1