1

I have the following dataframe. I want to group by a first. Within each group, I need to do a value count based on c and only pick the one with most counts if the value in c is not EMP. If the value in c is EMP, then I want to pick the one with the second most counts. If there is no other value than EMP, then it should be EMP as in the case where a = 4.

a        c
1        EMP
1        y
1        y
1        z
2        z
2        z
2        EMP
2        z
2        a
2        a
3        EMP
3        EMP
3        k
4        EMP
4        EMP
4        EMP

The expected result would be

a        c
1        y
2        z
3        k
4        EMP

This is what I have tried so far: I have managed to sort it in the order I need so I could take the first row. However, I cannot figure out how to implement the condition for EMP using a lambda function with the drop_duplicates function as there is only the keep=first or keep=last option.

df = df.iloc[df.groupby(['a', 'c']).c.transform('size').mul(-1).argsort(kind='mergesort')]

Edit:

The mode solution worked, I have an additional question. My dataframe contains about 50 more columns, and I would like to have all of these columns in the end result as well, with the values corresponding to the rows picked using the mode operation and the first value that occurred for the EMP rows. How would I do that? Is there an easier solution than what is mentioned here where you create a dict of functions and pass it to agg? Using SQL for this is also fine.

user42
  • 871
  • 1
  • 10
  • 28

3 Answers3

1

One option could be to remove the EMP, get the mode per group, then reindex on unique a filling with 'EMP':

out = (df[df['c'].ne('EMP')]
       .groupby('a', sort=False)['c']
       .apply(lambda g: g.mode()[0])
       .reindex(df['a'].unique(), fill_value='EMP')
       .reset_index()
      )

Similar approach using a custom function:

def cust_mode(s):
    counts = s.value_counts(sort=False)
    if 'EMP' in counts.index:  # make the EMP count -1
        counts['EMP'] = -1
    return counts.idxmax()

out = df.groupby('a', as_index=False)['c'].agg(cust_mode)

Another option would be to perform a custom sort by value_counts and moving the EMP to the top (using numpy.lexsort), then getting the last value per group (groupby.last):

import numpy as np

count = df.merge(df.value_counts().reset_index(name='count'))['count']
out = (df.iloc[np.lexsort([count, df['c'].ne('EMP')])]
         .groupby('a', as_index=False).last()
       )

Output:

   a    c
0  1    y
1  2    z
2  3    k
3  4  EMP
mozway
  • 194,879
  • 13
  • 39
  • 75
1

This should work as well:

(df.value_counts()
 .reset_index(name = 'count')
 .sort_values('c',key = lambda x: x.eq('EMP'))
 .groupby('a')[['a','c']].head(1)
 .sort_values('a'))

Output:

   a    c
2  1    y
0  2    z
8  3    k
1  4  EMP
rhug123
  • 7,893
  • 1
  • 9
  • 24
  • 1
    This is more or less equivalent to my `lexsort` approach, sorting first by count (implicitly done by default in `value_counts`), then by EMP status ;) – mozway Apr 13 '23 at 15:02
1

Another possible solution:

out = df.value_counts(['a', 'c']).reset_index()
pd.concat([
    out.loc[~out.duplicated(['a'], keep=False) & out['c'].eq('EMP')],
    out.loc[out['c'].ne('EMP')].sort_values('count', ascending=False)
    .drop_duplicates('a')])[['a', 'c']]

Output:

   a    c
1  4  EMP
0  2    z
2  1    y
8  3    k
PaulS
  • 21,159
  • 2
  • 9
  • 26