drop unused categories using groupby on categorical variable in pandas

Question

As per Categorical Data - Operations, by default groupby will show “unused” categories:

In [118]: cats = pd.Categorical(["a","b","b","b","c","c","c"], categories=["a","b","c","d"])

In [119]: df = pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})

In [120]: df.groupby("cats").mean()
Out[120]: 
      values
cats        
a        1.0
b        2.0
c        4.0
d        NaN

How to obtain the result with the “unused” categories dropped? e.g.

  values
cats        
a        1.0
b        2.0
c        4.0

If you search it https://stackoverflow.com/questions/40534489/a-faster-way-of-removing-unused-categories-in-pandas — BENY, Jan 02 '18 at 17:11
Possible duplicate of [A Faster Way of Removing Unused Categories in Pandas?](https://stackoverflow.com/questions/40534489/a-faster-way-of-removing-unused-categories-in-pandas) — tales, Jan 02 '18 at 17:12
Hmm, the same function is used there, but I have no idea what it's doing. :-) — cs95, Jan 02 '18 at 17:14

score 16 · Accepted Answer · edited Jan 17 '23 at 01:18

16

Since version 0.23 you can specify observed=True in the groupby call to achieve the desired behavior.

edited Jan 17 '23 at 01:18

wjandrea

28,235
9
60
81

answered Sep 14 '19 at 08:56

Dienow

1,317
12
13

score 12 · Answer 2 · edited Jan 17 '23 at 01:24

12

Option 1
remove_unused_categories

df.groupby(df['cats'].cat.remove_unused_categories()).mean()

      values
cats        
a          1
b          2
c          4

You can also make the assignment first, and then groupby -

df.assign(cats=df['cats'].cat.remove_unused_categories()).groupby('cats').mean()

Or,

df['cats'] = df['cats'].cat.remove_unused_categories()
df.groupby('cats').mean()

      values
cats        
a          1
b          2
c          4

Option 2
astype to str conversion -

df.groupby(df['cats'].astype(str)).mean()

      values
cats        
a          1
b          2
c          4

edited Jan 17 '23 at 01:24

wjandrea

28,235
9
60
81

answered Jan 02 '18 at 17:07

cs95

379,657
97
704
746

This answer is now outdated. See the more modern solution using `observed=True` argument to `groupby`. – Nick Crews Jan 06 '22 at 19:02

score 3 · Answer 3 · answered Jan 02 '18 at 17:08

3

Just chain with dropna. Like so:

df.groupby("cats").mean().dropna()

      values
cats
a        1.0
b        2.0
c        4.0

answered Jan 02 '18 at 17:08

Grr

15,553
7
65
85

1

I was rather looking for the`remove_unused_categories` solution, as the example is not my exact usecase and this solution does not help. – tales Jan 02 '18 at 17:10
Should be obvious, but this will also remove categories that have an actual NaN value, not just categories that aren't present. – wjandrea Jan 17 '23 at 01:28

score 3 · Answer 4 · answered Sep 06 '19 at 18:48

If you want to remove unused categories from all categorical columns, you can:

def remove_unused_categories(df: pd.DataFrame):
    for c in df.columns:
        if pd.api.types.is_categorical_dtype(df[c]):
            df[c].cat.remove_unused_categories(inplace=True)

Then before calling groupby, call:

remove_unused_categories(df_with_empty_cat)

drop unused categories using groupby on categorical variable in pandas

4 Answers4

Linked

Related