4

I'm having some trouble splitting the aggregation step of a group-by operation across multiple cores. I have the following working code, and would like to apply it over several processors:

import pandas as pd
import numpy as np
from multiprocessing import Pool, cpu_count

mydf = pd.DataFrame({'v1':[1,2,3,4]*6,'v2':['a','b','c']*8,'v3':np.arange(20,44)})

Which I can then apply the following GroupBy operation: (the step I wish to do in parallel)

pd.groupby(mydf,by=['v1','v2']).apply(lambda x: np.percentile(x['v3'],[20,30]))

yielding the series:

1   a     [22.4, 23.6]
    b     [26.4, 27.6]
    c     [30.4, 31.6]
2   a     [31.4, 32.6]
    b     [23.4, 24.6]
    c     [27.4, 28.6]

I Tried the following, with reference to:parallel groupby

def applyParallel(dfGrouped, func):
    with Pool(1) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pd.concat(ret_list)

def myfunc(df):
    df['pct1'] = df.loc[:,['v3']].apply(np.percentile,args=([20],))
    df['pct2'] = df.loc[:,['v3']].apply(np.percentile,args=([80],))
    return(df)


grouped = pd.groupby(mydf,by=['v1','v2'])
applyParallel(grouped,myfunc)

But I'm losing the index structure and getting duplicates. I could probably solve this step with a further group by operation, but I think it shouldn't be too difficult to avoid it entirely. Any suggestions?

David Smith
  • 161
  • 1
  • 9

1 Answers1

1

Not that I'm still looking for an answer, but It'd probably be better to use a library that handles parallel manipulations of pandas DataFrames, rather than trying to do so manually.

Dask is one option which is intended to scale Pandas operations with little code modification.

Another option (but is maybe a little more difficult to set up) is PySpark

David Smith
  • 161
  • 1
  • 9
  • 2
    I tried to use dask for my groupby operation, but everything I tried made it slower and increased memory consumption. – gerrit Jan 30 '20 at 17:04