5

I want to do almost the same thing as this question.

However, the approach in the accepted answer by @jezrael takes way too long based on my dataset -- I have ~300k rows in the original dataframe, and it takes a few minutes to run the nlargest(1) command. Furthermore, I tried it on a head(1000) limited dataframe, and didn't get only 1 row for each within the value_count -- I got exactly the same Series back as the value_counts.

In my own words: Basically, my dataset has two columns like this:

Session Rating
A       Positive
A       Positive
A       Positive
A       Negative
B       Negative
B       Negative
C       Positive
C       Negative

Using counts = df.groupby('Session')['Rating'].value_counts() I get a Series object like this:

Session Rating
A       Positive  3
        Negative  1
B       Negative  2
C       Positive  1
        Negative  1

How do I get a dataframe where just the Rating with the max count is included? And in cases where there are multiple maxes (such as C), I would like to exclude that one from the returned table.

manestay
  • 83
  • 1
  • 1
  • 8
  • https://stackoverflow.com/a/32192310/2027457 Checkt his way for faster (numpy here will help you better than pandas) – n1tk Jul 05 '18 at 22:38
  • @0709 I tried the accepted answer on the same dataset he generated, and it didn't give me the same results-- the shape was equal to k for some reason. – manestay Jul 05 '18 at 22:59

1 Answers1

7

I think you want something like this.

df.groupby('Session')['Rating'].apply(lambda x: x.value_counts().head(1))

Output:

Session          
A        Positive    3
B        Negative    2
C        Negative    1
Name: Rating, dtype: int64
Scott Boston
  • 147,308
  • 15
  • 139
  • 187