Group by a column to find the most frequent value in another column?

Question

Group by a column to find most frequent value in another column. Example:

import pandas as pd
d = {'col1': ['green','green','green','blue','blue','blue'],'col2': ['gx','gx','ow','nb','nb','mj']}
df = pd.DataFrame(data=d)
df

gives:

col1   col2
green  gx
green  gx
green  ow
blue   nb
blue   nb
blue   xv

results:

for green to have gx and for blue to have nb

I may not have been clear enough but i don't want this. I want to keep the rows that have the most frequent value only. — user10288621, Aug 29 '18 at 08:40

jezrael · Accepted Answer · 2018-08-29T09:06:49.277

Use SeriesGroupBy.value_counts and select first value of index:

df = df.groupby('col1')['col2'].apply(lambda x: x.value_counts().index[0]).reset_index()
print (df)
    col1 col2
0   blue   nb
1  green   gx

Or add DataFrame.drop_duplicates:

df = df.groupby('col1')['col2'].value_counts().reset_index(name='v')

df = df.drop_duplicates('col1')[['col1','col2']]
print (df)
    col1 col2
0   blue   nb
2  green   gx

Or use Series.mode and select first value by positions by Series.iat:

df = df.groupby('col1')['col2'].apply(lambda x: x.mode().iat[0]).reset_index()
print (df)
    col1 col2
0   blue   nb
1  green   gx

EDIT:

Problem is with only NaNs groups:

d = {'col1': ['green','green','green','blue','blue','blue'],
     'col2': [np.nan,np.nan,np.nan,'nb','nb','mj']}
df = pd.DataFrame(data=d)

f = lambda x: np.nan if x.isnull().all() else x.value_counts().index[0]
#or
#f = lambda x: next(iter(x.value_counts().index), np.nan)
#another solution
#f = lambda x: next(iter(x.mode()), np.nan)
df = df.groupby('col1')['col2'].apply(f).reset_index()
print (df)
    col1 col2
0   blue   nb
1  green  NaN

I tried to apply it to a dataframe other than the example and it says: `IndexError: index 0 is out of bounds for axis 0 with size 0`. Do you know why? — user10288621, Aug 29 '18 at 08:55

jpp · Answer 2 · 2018-08-29T08:58:38.477

3

You can use GroupBy + transform with pd.Series.mode and then drop_duplicates.

With this solution, the index from your original dataframe is maintained. It assumes there is only one mode, and so filters for one mode per group.

modes = df.groupby('col1')['col2'].transform(lambda x: x.mode().iat[0])
res = df[df['col2'] == modes].drop_duplicates()

print(res)

    col1 col2
0  green   gx
3   blue   nb

edited Aug 29 '18 at 08:58

answered Aug 29 '18 at 08:44

jpp

159,742
34
281
339

Group by a column to find the most frequent value in another column?

2 Answers2