1

Have a dataframe with several columns from which I want to extract one row for each "family" of individuals that has the most frequent number ("No"). I have tested this with a for -loop that seems to work, but being a newbe I wanted to know if there is a shorter/smarter way of doing it.

Here is a short example code:

import pandas as pd


ind = [ ('A', 'a', 0.1 , 9) ,
             ('B', 'b', 0.6  , 10) ,
             ('C', 'b', 0.4 , 10) ,
             ('D', 'b', 0.2, 7) ,
             ('E', 'a', 0.9  , 6) ,
             ('F', 'b', 0.7 , 11)
              ]


df = pd.DataFrame(ind, columns = ['Name' , 'Family', 'Prob', 'No'])

res = pd.DataFrame(columns = df.columns)

for name,g in df.groupby('Family'):
    v = g['No'].value_counts().idxmax()
    idx = g['No'] == v
    si = g[idx].iloc[0]
    res = res.append(si)
print(res)

I have looked at several exampels that do some of it like this but with that I can only get the "Family" and "No" and not the whole row...

Erik Thysell
  • 1,160
  • 1
  • 14
  • 31

2 Answers2

2

Here is an alternative using duplicated and mode+groupby with mode:

c = df['No'].eq(df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]))
c1 = df[['Family','No']].duplicated()
output = df[c & ~c1]

  Name Family  Prob  No
1    B      b   0.6  10
4    E      a   0.9   6
anky
  • 74,114
  • 11
  • 41
  • 70
2

Use GroupBy.transform with first mode, then filter and last remove duplicates by DataFrame.drop_duplicates:

df1 = (df[df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]).eq(df['No'])]
         .drop_duplicates(['Family','No']))
print (df1)
  Name Family  Prob  No
1    B      b   0.6  10
4    E      a   0.9   6
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252