1

If I have a panda df with following entries:

type    location    unit     condition
Car        01        19          2
Car        01        19          3
Car        01        19          1
Car        02        19          1
Car        01        20          1
Machine    05        09          1
Machine    05        09          2
Machine    05        09          3
Machine    15        09          1
Machine    15        10          1
Truck      02        09          2
Truck      02        09          1

For the duplicates (based on type | location | unit) I would like to only get the ones with the best condition. I tried several sort_value duplicates etc. but I think I have a logic issue in my head.

so the perfect result would look like this:

type    location    unit     condition
Car        01        19          3
Car        02        19          1
Car        01        20          1
Machine    05        09          3
Machine    15        09          1
Machine    15        10          1
Truck      02        09          2

Commentary on proposed duplicate

Sorry for the mistake: I totally forgot (type, location, unit and condition) are not the only columns there. There is at least two more currently (Torque and Wheels) which I would need also to have access to.

So I guess it is not possible to solve with the answers to this question.

Current full df

type    location    unit     condition      wheels      torque
Car        01        19          2            4          256
Car        01        19          3            4          320
Car        01        19          1            4          190
Car        02        19          1            4          280
Car        01        20          1            4          400
Machine    05        09          1            4          320
Machine    05        09          2            6          690
Machine    05        09          3            12        1180
Machine    15        09          1            4          290
Machine    15        10          1            6          445
Truck      02        09          2            6          625
Truck      02        09          1            8          804
halfer
  • 19,824
  • 17
  • 99
  • 186
YvesNr1
  • 43
  • 6
  • @jezrael: this post has been edited. I have no view on it, but would you take a look to see if the proposed duplicate does not fit, as per the OP's view? – halfer Apr 03 '20 at 09:58
  • @halfer - Good idea. I still guess dupe is correct, but rather ask by comment – jezrael Apr 03 '20 at 10:00
  • 1
    I think you need `df.loc[df.groupby(['type', 'location', 'unit'])['condition'].idxmax()]`, if not working can you add expected output? – jezrael Apr 03 '20 at 10:01

1 Answers1

1

You can use groupby and retain just the maximum condition:

df = df.groupby(['type', 'location', 'unit'])['condition'].max().reset_index()
Bruno Mello
  • 4,448
  • 1
  • 9
  • 39
  • Okay, that seams to filter out what I wanted. But little issue is that I have two more columns which are not in the df I posted above which I lose when I do the groupby. Or am I missing something? – YvesNr1 Apr 02 '20 at 11:42
  • What are those columns? You have to define how you want to aggregate them – Bruno Mello Apr 02 '20 at 11:42
  • there are two more colums called "wheels" and "torque" – YvesNr1 Apr 02 '20 at 11:53
  • They are unique by each ['type', 'location', 'unit']? – Bruno Mello Apr 02 '20 at 11:54
  • Well they are all over the place no duplicates at all at least for torque. So they should not be grouped at all. Its more like I just should still be able to access them so I can use them in my for loop. Right now I can not access them anymore – YvesNr1 Apr 02 '20 at 11:56
  • So you can't get the condition max because it returns a smaller dataframe and thus you will lose your information about the torque and well. You could get something like max_torque and max_wheel if that is what you want. – Bruno Mello Apr 02 '20 at 11:59