Parallelize apply after pandas groupby

Question

I have used rosetta.parallel.pandas_easy to parallelize apply after groupby, for example:

from rosetta.parallel.pandas_easy import groupby_to_series_to_frame
df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index)

However, has anyone figured out how to parallelize a function that returns a DataFrame? This code fails for rosetta, as expected.

def tmpFunc(df):
    df['c'] = df.a + df.b
    return df

df.groupby(df.index).apply(tmpFunc)
groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index)

Ivan · Accepted Answer · 2014-11-24T21:21:17.983

142

This seems to work, although it really should be built in to pandas

import pandas as pd
from joblib import Parallel, delayed
import multiprocessing

def tmpFunc(df):
    df['c'] = df.a + df.b
    return df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

if __name__ == '__main__':
    df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
    print 'parallel version: '
    print applyParallel(df.groupby(df.index), tmpFunc)

    print 'regular version: '
    print df.groupby(df.index).apply(tmpFunc)

    print 'ideal version (does not work): '
    print df.groupby(df.index).applyParallel(tmpFunc)

edited Nov 24 '14 at 21:21

answered Nov 19 '14 at 21:46

Ivan

2,871
3
16
19

6

do you know whether there has been any progress on incorporating parallelization into pandas? – NumenorForLife May 14 '15 at 19:53
5

By doing small modification to the function it can be made to return the hierarchical index that the regular apply returns: `def temp_func(func, name, group): return func(group), name def applyParallel(dfGrouped, func): retLst, top_index = zip(*Parallel(n_jobs=multiprocessing.cpu_count())(delayed(temp_func)(func, name, group) for name, group in dfGrouped)) return pd.concat(retLst, keys=top_index)` Dang, I can not figure out how to post code in comments... – BoZenKhaa Dec 10 '15 at 17:12
3

You should be able to make the "ideal version" work by binding `applyParallel` to `df`: `from types import MethodType; df.applyParallel = MethodType(applyParallel, df)` – shadowtalker Jun 16 '16 at 05:06
2

I have tried this method but it is not using all the cpus available, it is only using 1 or 2, even though I have 8- has it happend to someone? – Kailegh Sep 16 '19 at 14:25
1

What if the `tmpFunc` takes multiple arguments? – AleVis Jun 28 '20 at 19:55
8

Be careful, this can end up being slower than the single core version! If you send lots of data to each job but have only a short compute, it's not worth the overhead and it ends up being slower. – citynorman Oct 21 '20 at 02:46
How can this parallelization method imply to rolling with apply? I need to parallelize: df.groupby(df.index).rolling(6, min_periods=1).apply(tmpFunc). Also, how to add raw=True and reset index in the end? – Alex Oct 13 '21 at 10:24

Pietro Battiston · Answer 2 · 2020-04-28T11:26:09.327

Ivan's answer is great, but it looks like it can be slightly simplified, also removing the need to depend on joblib:

from multiprocessing import Pool, cpu_count

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pandas.concat(ret_list)

By the way: this can not replace any groupby.apply(), but it will cover the typical cases: e.g. it should cover cases 2 and 3 in the documentation, while you should obtain the behaviour of case 1 by giving the argument axis=1 to the final pandas.concat() call.

EDIT: the docs changed; the old version can be found here, in any case I'm copypasting the three examples below.

case 1: group DataFrame apply aggregation function (f(chunk) -> Series) yield DataFrame, with group axis having group labels

case 2: group DataFrame apply transform function ((f(chunk) -> DataFrame with same indexes) yield DataFrame with resulting chunks glued together

case 3: group Series apply function with f(chunk) -> DataFrame yield DataFrame with result of chunks glued together

@Keiku no idea, I had never heard of REPL before... but did you try with ``func = lambda x : x"? If this doesn't work either, I suggest you open a specific question. You should be able to reproduce just with ``applyParallel([('one', 1), ('two', 2)], your_func)`` — Pietro Battiston, Feb 01 '18 at 09:09
Thanks for suggestion. It seems that I tried restarting the console and resolved it. Sorry to trouble you. — Keiku, Feb 01 '18 at 09:14
The documentation does not seem to give examples anymore. Could someone elaborate, please? — Sebastian, Apr 28 '20 at 10:52

JD Long · Answer 3 · 2014-11-21T15:50:18.037

I have a hack I use for getting parallelization in Pandas. I break my dataframe into chunks, put each chunk into the element of a list, and then use ipython's parallel bits to do a parallel apply on the list of dataframes. Then I put the list back together using pandas concat function.

This is not generally applicable, however. It works for me because the function I want to apply to each chunk of the dataframe takes about a minute. And the pulling apart and putting together of my data does not take all that long. So this is clearly a kludge. With that said, here's an example. I'm using Ipython notebook so you'll see %%time magic in my code:

## make some example data
import pandas as pd

np.random.seed(1)
n=10000
df = pd.DataFrame({'mygroup' : np.random.randint(1000, size=n), 
                   'data' : np.random.rand(n)})
grouped = df.groupby('mygroup')

For this example I'm going to make 'chunks' based on the above groupby, but this does not have to be how the data is chunked. Although it's a pretty common pattern.

dflist = []
for name, group in grouped:
    dflist.append(group)

set up the parallel bits

from IPython.parallel import Client
rc = Client()
lview = rc.load_balanced_view()
lview.block = True

write a silly function to apply to our data

def myFunc(inDf):
    inDf['newCol'] = inDf.data ** 10
    return inDf

now let's run the code in serial then in parallel. serial first:

%%time
serial_list = map(myFunc, dflist)
CPU times: user 14 s, sys: 19.9 ms, total: 14 s
Wall time: 14 s

now parallel

%%time
parallel_list = lview.map(myFunc, dflist)

CPU times: user 1.46 s, sys: 86.9 ms, total: 1.54 s
Wall time: 1.56 s

then it only takes a few ms to merge them back into one dataframe

%%time
combinedDf = pd.concat(parallel_list)
 CPU times: user 296 ms, sys: 5.27 ms, total: 301 ms
Wall time: 300 ms

I'm running 6 IPython engines on my MacBook, but you can see it drops the execute time down to 2s from 14s.

For really long running stochastic simulations I can use AWS backend by firing up a cluster with StarCluster. Much of the time, however, I parallelize just across 8 CPUs on my MBP.

I will try this with my code, thank you. Can you explain to me why apply does not automatically parallelize operations? It seems like the whole benefit of having the apply function is to avoid looping, but if it is not doing that with these groups, what gives? — wolfsatthedoor, Nov 19 '14 at 21:41
There's a long story about parallelization being hard in Python because of the GIL. Keep in mind that apply is usually syntactic sugar and underneath it's doing the implied loop. Using parallelization is somewhat tricky because there are runtime costs to parallelization which sometimes negate the benefits of parallelization. — JD Long, Nov 19 '14 at 21:51
Is there a missing definition for `parallel_list` as it gives a error `name 'parallel_list' is not defined` at this line: `combinedDf = pd.concat(parallel_list)`? — Primer, Nov 21 '14 at 14:21
Ivan, clearly! He had a very good answer, I think. Much less hackity hack hack than mine. — JD Long, Nov 21 '14 at 14:21

spring · Answer 4 · 2017-09-21T21:21:53.667

A short comment to accompany JD Long's answer. I've found that if the number of groups is very large (say hundreds of thousands), and your apply function is doing something fairly simple and quick, then breaking up your dataframe into chunks and assigning each chunk to a worker to carry out a groupby-apply (in serial) can be much faster than doing a parallel groupby-apply and having the workers read off a queue containing a multitude of groups. Example:

import pandas as pd
import numpy as np
import time
from concurrent.futures import ProcessPoolExecutor, as_completed

nrows = 15000
np.random.seed(1980)
df = pd.DataFrame({'a': np.random.permutation(np.arange(nrows))})

So our dataframe looks like:

Note that column 'a' has many groups (think customer ids):

len(df.a.unique())
15000

A function to operate on our groups:

def f1(group):
    time.sleep(0.0001)
    return group

Start a pool:

ppe = ProcessPoolExecutor(12)
futures = []
results = []

Do a parallel groupby-apply:

%%time

for name, group in df.groupby('a'):
    p = ppe.submit(f1, group)
    futures.append(p)

for future in as_completed(futures):
    r = future.result()
    results.append(r)

df_output = pd.concat(results)
del ppe

CPU times: user 18.8 s, sys: 2.15 s, total: 21 s
Wall time: 17.9 s

Let's now add a column which partitions the df into many fewer groups:

df['b'] = np.random.randint(0, 12, nrows)

Now instead of 15000 groups there are only 12:

len(df.b.unique())
12

We'll partition our df and do a groupby-apply on each chunk.

ppe = ProcessPoolExecutor(12)

Wrapper fun:

def f2(df):
    df.groupby('a').apply(f1)
    return df

Send out each chunk to be operated on in serial:

%%time

for i in df.b.unique():
    p = ppe.submit(f2, df[df.b==i])
    futures.append(p)

for future in as_completed(futures):
    r = future.result()
    results.append(r)

df_output = pd.concat(results) 

CPU times: user 11.4 s, sys: 176 ms, total: 11.5 s
Wall time: 12.4 s

Note that the amount of time spend per group has not changed. Rather what has changed is the length of the queue from which the workers read off of. I suspect that what is happening is that the workers cannot access the shared memory simultaneously, and are returning constantly to read off the queue, and are thus stepping on each others toes. With larger chunks to operate on, the workers return less frequently and so this problem is ameliorated and the overall execution is faster.

On my machine with 4 physical cores I can only see the benefit of parallelization if delay in f1 is increased, otherwise serial has better time. — Wildhammer, Nov 04 '21 at 21:34

Alireza · Answer 5 · 2022-02-03T19:07:05.147

People are moving to use bodo for parallelism. It's the fastest engine available to parallelize python as it compiles your code with MPI. Its new compiler made it to be much faster than Dask, Ray, multiprocessing, pandarel, etc. Read bodo vs Dask in this blog post, and see what Travis has to say about bodo in his LinkedIn! He is the founder of Anaconda: Quote "bodo is the real deal"

https://bodo.ai/blog/performance-and-cost-of-bodo-vs-spark-dask-ray

https://www.linkedin.com/posts/teoliphant_performance-and-cost-evaluation-of-bodo-vs-activity-6873290539773632512-y5iZ/

As per how to use groupby with bodo, here I write a sample code:

#install bodo through your terminal

conda create -n Bodo python=3.9 -c conda-forge
conda activate Bodo
conda install bodo -c bodo.ai -c conda-forge

Here is a code sample for groupby:

import time
import pandas as pd
import bodo


@bodo.jit
def read_data():
""" a dataframe with 2 columns, headers: 'A', 'B' 
or you can just create a data frame instead of reading it from flat file 
"""
    return pd.read_parquet("your_input_data.pq")


@bodo.jit
def data_groupby(input_df):
    t_1 = time.time()
    df2 = input_df.groupby("A", as_index=False).sum()
    t_2 = time.time()
    print("Compute time: {:.2f}".format(t_2-t_1))
    return df2, t_2-t_1


if __name__ == "__main__":
    df = read_data()
    t0 = time.time()
    output, compute_time = data_groupby(df)
    t2 = time.time()
    total_time = t2 - t0
    if bodo.get_rank() == 0:
        print("Compilation time: {:.2f}".format(total_time - compute_time))
        print("Total time second call: {:.2f}".format(total_time))

and finally run it with mpiexec through your terminal. -n determines the number of cores (CPUs) you want to run it.

mpiexec -n 4 python filename.py

score 4 · Answer 6 · answered Feb 04 '19 at 22:32

Personally I would recommend using dask, per this thread.

As @chrisb pointed out, multiprocessing with pandas in python might create unnecessary overhead. It might also not perform as well as multithreading or even as a single thread.

Dask is created specifically for multiproccessing.

Han Zhang · Answer 7 · 2021-10-14T09:51:21.893

EDIT: To achieve better calculation performance on pandas groupby, you can use numba to compile your code into C code at runtime and run at C speed. If the function you apply after groupby is pure numpy calculation, it will be super fast (much faster than this parallelization).

You can use either multiprocessing or joblib to achieve parallelization. However, if the number of groups is large and each group DataFrame is large, the running time can be worse as you need to transfer those groups into CPUs for many times. To reduce the overhead, we can first divide the data into large chunks, and then parallelize computation on these chunks.

For example, suppose you are processing the stock data, where you need to group the stocks by their code and then calculate some statistics. You can first group by the first character of the code (large chunks), then do the things within this dummy group:

import pandas as pd
from joblib import Parallel, delayed

def group_func(dummy_group):
    # Do something to the group just like doing to the original dataframe.
    #     Example: calculate daily return.
    res = []
    for _, g in dummy_group.groupby('code'):
        g['daily_return']  = g.close / g.close.shift(1)
        res.append(g)
    return pd.concat(res)

stock_data = stock_data.assign(dummy=stock_data['code'].str[0])

Parallel(n_jobs=-1)(delayed(group_func)(group) for _, group in stock_data.groupby('dummy'))

Jason Carpenter · Answer 8 · 2022-07-28T19:20:26.533

DISCLAIMER: I am the owner and primary contributor/maintainer of swifter

swifter is a python package that I created over 4 years ago as a package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. As of today, swifter has over 2k GitHub stars, 250k downloads/month, and 95% code coverage.

As of v1.3.2, swifter offers a simple interface to a performant parallelized groupby apply:

df.swifter.groupby(df.index).apply(tmpFunc)

I have also created performance benchmarks showcasing swifter's performance improvement, with a key visual replicated here: Swifter Groupby Apply Performance Benchmark

You can easily install swifter (with groupby apply functionality) either via pip:

pip install swifter[groupby]>=1.3.2

or via conda:

conda install -c conda-forge swifter>=1.3.2 ray>=1.0.0

Please check out the README and documentation for further information

Parallelize apply after pandas groupby

8 Answers8

Linked