I'm having some trouble splitting the aggregation step of a group-by operation across multiple cores. I have the following working code, and would like to apply it over several processors:
import pandas as pd
import numpy as np
from multiprocessing import Pool, cpu_count
mydf = pd.DataFrame({'v1':[1,2,3,4]*6,'v2':['a','b','c']*8,'v3':np.arange(20,44)})
Which I can then apply the following GroupBy operation: (the step I wish to do in parallel)
pd.groupby(mydf,by=['v1','v2']).apply(lambda x: np.percentile(x['v3'],[20,30]))
yielding the series:
1 a [22.4, 23.6]
b [26.4, 27.6]
c [30.4, 31.6]
2 a [31.4, 32.6]
b [23.4, 24.6]
c [27.4, 28.6]
I Tried the following, with reference to:parallel groupby
def applyParallel(dfGrouped, func):
with Pool(1) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pd.concat(ret_list)
def myfunc(df):
df['pct1'] = df.loc[:,['v3']].apply(np.percentile,args=([20],))
df['pct2'] = df.loc[:,['v3']].apply(np.percentile,args=([80],))
return(df)
grouped = pd.groupby(mydf,by=['v1','v2'])
applyParallel(grouped,myfunc)
But I'm losing the index structure and getting duplicates. I could probably solve this step with a further group by operation, but I think it shouldn't be too difficult to avoid it entirely. Any suggestions?