1

I have a dataframe with multiple attributes, some are repeating. I want to select the rows based on the max value in one column - but return the row having that value (not the max of every column). How??

Here's a sample:

df = pd.DataFrame({'Owner': ['Bob', 'Jane', 'Amy',
                            'Steve','Kelly'],
                   'Make': ['Ford', 'Ford', 'Jeep',
                           'Ford','Jeep'],
                   'Model': ['Bronco', 'Bronco', 'Wrangler',
                            'Model T','Wrangler'],
                   'Max Speed': [80, 150, 69, 45, 72],
                  'Customer Rating': [90, 50, 91, 75, 99]})

this gives us:

enter image description here

I want the row having the max(customer rating) for each Make/Model. Like this: enter image description here

Note this is NOT the same as df.groupby(['Make','Model']).max()

--> How do I do this?

Alex P
  • 45
  • 5
  • . [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/3404097) [Why are images of text, code and mathematical expressions discouraged?](https://meta.stackexchange.com/q/320052/266284) – philipxy Aug 03 '22 at 23:15

2 Answers2

1

A variation of your answer using idxmax:

>>> df.loc[df.groupby(['Make', 'Model'])['Customer Rating'].idxmax()]
   Owner  Make     Model  Max Speed  Customer Rating
0    Bob  Ford    Bronco         80               90
3  Steve  Ford   Model T         45               75
4  Kelly  Jeep  Wrangler         72               99

Another solution without groupby:

>>> df.sort_values('Customer Rating') \
      .drop_duplicates(['Make', 'Model'], keep='last') \
      .sort_index()

   Owner  Make     Model  Max Speed  Customer Rating
0    Bob  Ford    Bronco         80               90
3  Steve  Ford   Model T         45               75
4  Kelly  Jeep  Wrangler         72               99
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

I found an answer! I'm leaving the question up in case anyone else didn't recognize it as well.

Check out this post: Select the max row per group - pandas performance issue

I couldn't tell from that post that it was in fact what I needed, but it is. I tried two of them successfully:

def using_rank(df):
mask = (df.groupby(['Make', 'Model'])['Customer Rating'].rank(method='first', ascending=False) == 1)
return df.loc[mask]
df2 = using_rank(df)
df2

returns:

enter image description here

this also worked fine:

def using_sort(df):
df = df.sort_values(by=['Customer Rating'], ascending=False, kind='mergesort')
return df.groupby(['Make', 'Model'], as_index=False).first()
Alex P
  • 45
  • 5