What is the most efficient way to find a frequent item in pandas group by aggregation

Question

I am trying to find most frequently occurred value in each column while aggregating by pandas. To find most frequent value i am using value_counts as suggested here, but facing performance issue(refer bellow snippet code)

import random
import time

import pandas as pd
df = pd.DataFrame({'Country_ID': [random.randint(1000, 100001) for i in
                                   range(100000)],
                  'City': [random.choice(['NY', 'Paris', 'London',
                                   'Delhi']) for i in range(100000)]})
agg_col = {'City': lambda x: x.value_counts().index[0]}
start = time.time()
df_agg = df.groupby('Country_ID').agg(agg_col)
print("Time Taken: {0}".format(time.time() - start))
print("Data: ", df_agg.head(5))

result:

Time Taken: 24.467301845550537
Data: 
              City
Country_ID        
1000        London
1001         Paris
1003        London
1004        London
1006        London

Is there any way I could improve above performance?

Tried with scipy.stats `agg_col2 = {'City': lambda x: scipy.stats.mode(x)[0][0]}`? — Zero, Jul 15 '18 at 13:24

score 3 · Answer 1 · answered Jul 15 '18 at 14:04

Certain operations in pandas are much slower than they need to be (idxmax on a groupby can be painful even when max itself is fast, for example). Sometimes falling back to ops which are less efficient in theory (like sorting when all we need is the maximum) but which travel along optimized paths can be helpful. [Okay, here we could actually use transform(max) and then a filter.]

def orig(df):
    agg_col = {'City': lambda x: x.value_counts().index[0]}
    df_agg = df.groupby('Country_ID').agg(agg_col)
    return df_agg.reset_index()

def via_sort(df):
    size = df.groupby(["Country_ID", "City"]).size().reset_index()
    size = size.sort_values(["City", 0])  # sort City to break ties
    df_agg = size.groupby("Country_ID")["City"].last()
    return df_agg.reset_index()

which gives me

In [33]: %time orig_out = orig(df.iloc[:10000])
Wall time: 4.87 s

In [34]: %time sort_out = via_sort(df.iloc[:10000])
Wall time: 31.2 ms

I was too impatient to wait for the full 100000 to finish for the original code, but:

In [39]: %time sort_out = via_sort(df)
Wall time: 93.6 ms

Now it should be noted that the two don't give exactly the same results-- they differ in how ties are handled. For example:

In [48]: orig_out.loc[(orig_out != sort_out).any(axis=1)].head(1)
Out[48]: 
   Country_ID    City
9        1093  London

In [49]: sort_out.loc[(orig_out != sort_out).any(axis=1)].head(1)
Out[49]: 
   Country_ID   City
9        1093  Paris

In [50]: df.query('Country_ID == 1093')
Out[50]: 
       Country_ID    City
1758         1093  London
7378         1093   Paris
29188        1093   Delhi

but you can customize this as you like.

ayorgo · Answer 2 · 2018-07-15T20:14:16.767

The following gives pretty much instantaneous result (around 0.1s on my machine):

Get counts series with the multiindex ('Country_ID', 'City')

df_agg = df.groupby('Country_ID')['City'].value_counts()

Country_ID  City  
1000        London    6
            Delhi     4
            Paris     3
            NY        2
1001        NY        6
            Delhi     4
            Paris     4
            London    3
1002        Delhi     2
            Paris     2
            London    1
            NY        1

Move part of the multiindex to a column

df_agg = df_agg.reset_index(level='City', name='Counts')

              City  Counts
Country_ID                
1000        London       6
1000         Delhi       4
1000         Paris       3
1000            NY       2
1001            NY       6
1001         Delhi       4
1001         Paris       4
1001        London       3
1002         Delhi       2
1002         Paris       2
1002        London       1
1002            NY       1

Since value_counts() returns sorted result, we just need to remove duplicates keeping the first row per index

df_agg = df_agg[~df_agg.index.duplicated(keep='first')]

              City  Counts
Country_ID                
1000        London       6
1001            NY       6
1002         Delhi       2

Now simply drop the counts

df_agg = df_agg[['City']]

              City
Country_ID        
1000        London
1001            NY
1002         Delhi

What is the most efficient way to find a frequent item in pandas group by aggregation

2 Answers2