0

I am having an issue with a dateframe I have created. It has multiple columns along with the 2 columsn im trying to group by and its a date time.

the table is as follows-

product  number  color solddate    price
TV       123     green 20/04/2020  50
TV       123     green 19/04/2020  100

Im trying to return just the row with the highest price. Regardless of solddate. but I still need to return the solddate.

product  number  color solddate    price
TV       123     green 19/04/2020  100

This is on a dataframe which contains approximately 70k rows.

I was trying with :

price = new_df['price']
c_maxes = new_df.groupby(['product', 'number','color' ]).price.transform(max)
new__df2 = c_maxes.loc[c_maxes == new_df.price]

print(new__df2)

but its not working, if I output to excel im still able to use that dedup function and remove around 600 rows.

Thanks

Lee Cable
  • 3
  • 2

2 Answers2

0

I think this post might be relevant.

Also, this method might be useful ( came across this link ):

df.loc[df.groupby(['product', 'number', 'color'])['price'].idxmax()]
Sajan
  • 1,247
  • 1
  • 5
  • 13
0

you can try something like this :

In [175]: df = pd.DataFrame([{'product':'TV','number':123,'color':'green','solddate':'20/04/2020','price':50},{'product':'TV','color':'green','number':123,'solddate':'19/04/2020', 'price':100}]
     ...: )

In [176]: df
Out[176]:
   color  number  price product    solddate
0  green     123     50      TV  20/04/2020
1  green     123    100      TV  19/04/2020

In [177]: c_maxes = df.groupby(['product','color','number'])['price'].transform(max) == df.price

In [178]: c_maxes
Out[178]:
0    False
1     True
Name: price, dtype: bool

In [179]: new_df2 = df[c_maxes]

In [180]: new_df2
Out[180]:
   color  number  price product    solddate
1  green     123    100      TV  19/04/2020
Rajat Mishra
  • 3,635
  • 4
  • 27
  • 41