groupby - selecting only certain groups

Question

I have the below DataFrame, I want to select the Services where there are less than 2 'Healthy' instances of that service. In this case I want the Series (EmailService, UserService, NotificationService)

              CPU              Service  Memory   Status
IP                                                     
10.22.11.150   13       StorageService      55  Healthy
10.22.11.90    23       StorageService      19  Healthy
10.22.11.91    10         EmailService      44  Healthy
10.22.11.92    69          UserService       1  Healthy
10.22.11.93    63  NotificationService      81  Healthy
10.22.11.93    87  NotificationService      98  Unhealthy

I think I need this grouping,

grouped = servers_df.groupby('Service')

but not sure how to count the Status column and then get the result based on that.

jezrael · Accepted Answer · 2018-02-05T21:40:26.327

Use transform with lambda function for count Healthy and compare, last filter by boolean indexing:

df = df[df.groupby('Service')['Status'].transform(lambda x: (x=='Healthy').sum() < 2)]
print (df)
             CPU              Service  Memory     Status
IP                                                      
10.22.11.91   10         EmailService      44    Healthy
10.22.11.92   69          UserService       1    Healthy
10.22.11.93   63  NotificationService      81    Healthy
10.22.11.93   87  NotificationService      98  Unhealthy

If want check only 1 value Healthy per group use duplicated with keep=False for all dupes and chain it with condition for compare Healthy for filter out multiple Unhealthy, then invert condition by ~ and filter boolean indexing again:

df = df[~(df.duplicated(['Service','Status'], keep=False) & (df['Status'] == 'Healthy'))]

Tai · Answer 2 · 2018-02-05T21:52:30.980

You can also use filter.

df.groupby("Service").filter(lambda x: len(x[x.Status == "Healthy"]) < 2)

Might be slower in speed according to jezrael's experiment in this answer

Another way: using apply (modified from jezrael's transform solution)

df.groupby('Service').apply(
                   lambda x: x if (x.Status == 'Healthy').sum() < 2 else None)


                        IP         CPU  Service              Memory Status
Service                     
EmailService        2   10.22.11.91 10  EmailService         44 Healthy
NotificationService 4   10.22.11.93 63  NotificationService  81 Healthy
                    5   10.22.11.93 87  NotificationService  98 Unhealthy
UserService         3   10.22.11.92 69  UserService          1  Healthy

Unfortunately `filter` is really slow :( – jezrael Feb 05 '18 at 21:29 — jezrael, Feb 05 '18 at 21:29

score 1 · Answer 3 · edited Feb 05 '18 at 21:53

1

IIUC

s=df[df.Status=='Healthy'].groupby('Service').Service.count().lt(2)
df.loc[df.Service.isin(s[s].index)]

    IP          CPU Service             Memory  Status
2   10.22.11.91 10  EmailService        44      Healthy
3   10.22.11.92 69  UserService         1       Healthy
4   10.22.11.93 63  NotificationService 81      Healthy
5   10.22.11.93 87  NotificationService 98      Unhealthy

edited Feb 05 '18 at 21:53

Tai

7,684
3
29
49

answered Feb 05 '18 at 21:33

BENY

317,841
20
164
234

1

@Tai yep let me fix – BENY Feb 05 '18 at 21:39

groupby - selecting only certain groups

3 Answers3