One of the great features of groupby objects in Pandas is the ability to use apply
to run arbitrary functions on groups. I am trying to parallelize this using multiprocessing
.
So starting out with a single groupby
object, I want to:
- split it into multiple groupby objects
- feed them to
multiprocessing.Pool
workers - run
groupby.apply
on them - concatenate the result
Here's the dream workflow in code:
# create the initial groupby
gb = df.groupby('variable')
# split into multiple groupby's
many_groupbys = gb.split(n_chunks=10)
# now many_groupbys is a list of 10 groupby objects
# this is our transformer
def func(groupby):
return groupby.apply(transformation)
# submit to pool
with Pool(10) as pool:
results = pool.map(func, many_groupbys)
result = pd.concat(results)
So, is there a way to split a single groupby object into multiple groupby objects? Is there a better workflow for parallelization of computations on dataframes, where you can't arbitrarily split on rows and you care about doing processing on groups of rows?
Please note, I don't want to process groups individually, I want groupby objects.