I am trying to find most frequently occurred value in each column while aggregating by pandas. To find most frequent value i am using value_counts
as suggested here, but facing performance issue(refer bellow snippet code)
import random
import time
import pandas as pd
df = pd.DataFrame({'Country_ID': [random.randint(1000, 100001) for i in
range(100000)],
'City': [random.choice(['NY', 'Paris', 'London',
'Delhi']) for i in range(100000)]})
agg_col = {'City': lambda x: x.value_counts().index[0]}
start = time.time()
df_agg = df.groupby('Country_ID').agg(agg_col)
print("Time Taken: {0}".format(time.time() - start))
print("Data: ", df_agg.head(5))
result:
Time Taken: 24.467301845550537
Data:
City
Country_ID
1000 London
1001 Paris
1003 London
1004 London
1006 London
Is there any way I could improve above performance?