2

I have a Pandas dataframe that looks as follows:

ID  Cat
1   SF
1   W
1   F
2   R64
2   SF
2   F

The first column is an identifier and the second column contains categorical data where the order is as follows: R64 < SF < F < W

I want a new dataframe that contains for each ID the maximum categorical value. The resulting dataframe should look as follows:

ID  Cat
1   W
2   F

I tried the solution from this thread, but it does not seem to work for categorical data: df.groupby("ID", as_index=False).Cat.max()

The result with this approach looks like this:

ID  number
1   SF
2   SF

I declare the categorical column like this:

df['Cat'] = pd.Categorical(df['Cat'], categories = ["R64", "SF", "F", "W"], ordered = True)
beta
  • 5,324
  • 15
  • 57
  • 99

1 Answers1

3

It's possible with an extension. Define an ordering in your column:

df.Cat = pd.Categorical(
    df.Cat, categories=['R64', 'SF', 'F', 'W'], ordered=True)

Now, groupby works:

df.groupby('ID').Cat.max().reset_index()

   ID Cat
0   1   W
1   2   F

Alternatively, use sort_values, groupby, and head.

df.sort_values(['ID', 'Cat'], ascending=[True, False]).groupby('ID').head(1)

   ID Cat
1   1   W
5   2   F
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Strangely, for me only the second approach works, even though I use `ordered = True` on the categorical column. – beta Jun 09 '18 at 21:36
  • But the question then becomes: why does `df.groupby("ID", as_index=False).Cat.max()` not work but `df.groupby("ID").Cat.max().reset_index()` work? – jpp Jun 09 '18 at 21:36
  • nope, also does not work with `reset_index()`. What works is `sort_values` + `groupby` + `head`. – beta Jun 09 '18 at 21:37
  • @beta, This could be a version issue. In v0.23, I see `as_index` fail versus `reset_index` working. – jpp Jun 09 '18 at 21:38
  • @jpp, OP, I've opened a new question on this problem: https://stackoverflow.com/questions/50778770/weird-behaviour-with-groupby-on-ordered-categorical-columns – cs95 Jun 09 '18 at 21:40
  • @coldspeed, Thanks, seems like a bug or, at a minimum, an insufficiently documented "feature". – jpp Jun 09 '18 at 21:40
  • @beta I foresee this as a bug, so I was watchful to provide you a second solution that I am confident will work. – cs95 Jun 09 '18 at 21:41
  • 1
    Sorry, my bad. When trying the solution with `reset_index()` I did not see that `as_index=False` was removed. When I remove it (and add `reset_index()`) it works. so both provided solutions work. thanks! – beta Jun 09 '18 at 22:08