GroupBy aggregate count based on specific column

Question

I've been looking for a few hours and can't seem to find a topic related to that exact matter.

So basically, I want to apply on a groupby to find something else than the mean. My groupby returns two columns 'feature_name' and 'target_name', and I want to replace the value in 'target_name' by something else : the number of occurences of 1, of 0, the difference between both, etc.

print(df[[feature_name, target_name]])

When I print my dataframe with the column I use, I get the following : screenshot

I already have the following code to compute the mean of 'target_name' for each value of 'feature_name':

df[[feature_name, target_name]].groupby([feature_name],as_index=False).mean()

Which returns : this.

And I want to compute different things than the mean. Here are the values I want to compute in the end : what I want

In my case, the feature 'target_name' will always be equal to either 1 or 0 (with 1 being 'good' and 0 'bad'.

I have seen this example from an answer.:

df.groupby(['catA', 'catB'])['scores'].apply(lambda x: x[x.str.contains('RET')].count())

But I don't know how to apply this to my case as x would be simply an int. And after solving this issue, I still need to compute more than just the count!

Thanks for reading ☺

Dillon · Accepted Answer · 2018-06-18T10:17:16.413

import pandas as pd
import numpy as np

def my_func(x):
    # Create your 3 metrics here
    calc1 = x.min()
    calc2 = x.max()
    calc3 = x.sum()

    # return a pandas series 
    return pd.Series(dict(metric1=calc1, metric2=calc2, metric3=calc3))


# Apply the function you created
df.groupby(...)['columns needed to calculate formulas'].apply(my_func).unstack()

Optionally, using .unstack() at the end allows you to see all your 3 metrics as column headers

As an example:

df
Out[]:
   Names         A         B
0     In  0.820747  0.370199
1    Out  0.162521  0.921443
2     In  0.534743  0.240836
3    Out  0.910891  0.096016
4     In  0.825876  0.833074
5    Out  0.546043  0.551751
6     In  0.305500  0.091768
7    Out  0.131028  0.043438
8     In  0.656116  0.562967
9    Out  0.351492  0.688008
10    In  0.410132  0.443524
11   Out  0.216372  0.057402
12    In  0.406622  0.754607
13   Out  0.272031  0.721558
14    In  0.162517  0.408080
15   Out  0.006613  0.616339
16    In  0.313313  0.808897
17   Out  0.545608  0.445589
18    In  0.353636  0.465455
19   Out  0.737072  0.306329

df.groupby('Names')['A'].apply(my_func).unstack()
Out[]:
        metric1   metric2   metric3
Names                              
In     0.162517  0.825876  4.789202
Out    0.006613  0.910891  3.879669

Thank you! Works perfectly. – Thomas Coquereau Jun 18 '18 at 12:23 — Thomas Coquereau, Jun 18 '18 at 12:23

GroupBy aggregate count based on specific column

1 Answers1