0

One of the great features of groupby objects in Pandas is the ability to use apply to run arbitrary functions on groups. I am trying to parallelize this using multiprocessing.

So starting out with a single groupby object, I want to:

  1. split it into multiple groupby objects
  2. feed them to multiprocessing.Pool workers
  3. run groupby.apply on them
  4. concatenate the result

Here's the dream workflow in code:

# create the initial groupby
gb = df.groupby('variable')

# split into multiple groupby's
many_groupbys = gb.split(n_chunks=10)

# now many_groupbys is a list of 10 groupby objects

# this is our transformer
def func(groupby):
    return groupby.apply(transformation)

# submit to pool
with Pool(10) as pool:
    results = pool.map(func, many_groupbys)

result = pd.concat(results)

So, is there a way to split a single groupby object into multiple groupby objects? Is there a better workflow for parallelization of computations on dataframes, where you can't arbitrarily split on rows and you care about doing processing on groups of rows?

Please note, I don't want to process groups individually, I want groupby objects.

Andrey Portnoy
  • 1,430
  • 15
  • 24
  • Possible duplicate of [Parallelize apply after pandas groupby](https://stackoverflow.com/questions/26187759/parallelize-apply-after-pandas-groupby) – Lev Zakharov Aug 15 '18 at 01:37
  • @LevZakharov Not a duplicate (although related), since I am not looking to process groups individually. – Andrey Portnoy Aug 15 '18 at 01:48
  • Then check this [answer](https://stackoverflow.com/questions/45545110/how-do-you-parallelize-apply-on-pandas-dataframes-making-use-of-all-cores-on-o). – Lev Zakharov Aug 15 '18 at 01:52
  • @LevZakharov Yeah, unfortunately Dask doesn't support multiindexed dataframes, which I really need. – Andrey Portnoy Aug 15 '18 at 01:54

0 Answers0