Here is how my dataset looks like:
Name | Country
---------------
Alex | USA
Tony | DEU
Alex | GBR
Alex | USA
I am trying to get something like this out, essentially grouping and counting:
Name | Country
---------------
Alex | {USA:2,GBR:1}
Tony | {DEU:1}
Works, but slow on LARGE datasets
Here is my code that does work on smaller dfs, but takes forever on bigger dfs (mine is around 14 million rows). I also use the multiprocessing module to speed up, but it doesn't help much:
def countNames(x):
return dict(Counter(x))
def aggregate(df_full,nameList):
df_list = []
for q in nameList:
df = df_full[df_full['Name']==q]
df_list.append(df.groupby('Name')['Country'].apply(lambda x: str(countNames(x))).to_frame().reset_index())
return pd.concat(df_list)
df = pd.DataFrame({'Name':['Alex','Tony','Alex','Alex'],
'Country':['USA','GBR','USA','DEU']})[['Name','Country']]
aggregate(df,df.Name.unique())
Is there anything that can speed up the internal logic (except for running with multiprocessing)?