24

i have a dataframe with following data :

invoice_no  dealer  billing_change_previous_month        date
       110       1                              0  2016-12-31
       100       1                         -41981  2017-01-30
      5505       2                              0  2017-01-30
      5635       2                          58730  2016-12-31

i want to have only one dealer with the maximum date . The desired output should be like this :

invoice_no  dealer  billing_change_previous_month        date
       100       1                         -41981  2017-01-30
      5505       2                              0  2017-01-30

each dealer should be distinct with maximum date, thanks in advance for your help.

3novak
  • 2,506
  • 1
  • 17
  • 28
Anurag Rawat
  • 445
  • 1
  • 4
  • 13

3 Answers3

30

You can use boolean indexing using groupby and transform

df_new = df[df.groupby('dealer').date.transform('max') == df['date']]

    invoice_no  dealer  billing_change_previous_month   date
1   100         1       -41981                          2017-01-30
2   5505        2       0                               2017-01-30

The solution works as expected even if there are more than two dealers (to address question posted by Ben Smith),

df = pd.DataFrame({'invoice_no':[110,100,5505,5635,10000,10001], 'dealer':[1,1,2,2,3,3],'billing_change_previous_month':[0,-41981,0,58730,9000,100], 'date':['2016-12-31','2017-01-30','2017-01-30','2016-12-31', '2019-12-31', '2020-01-31']})

df['date'] = pd.to_datetime(df['date'])
df[df.groupby('dealer').date.transform('max') == df['date']]


    invoice_no  dealer  billing_change_previous_month   date
1   100         1       -41981                          2017-01-30
2   5505        2       0                               2017-01-30
5   10001       3       100                             2020-01-31
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • 1
    Thanks @Vaishali can you please explian what does (== df['date']) do ? – Anurag Rawat Feb 13 '18 at 05:36
  • 1
    transform doesn't change the shape of the data frame unlike groupby which aggregates. So df.groupby('dealer').date.transform('max') will give you a date series with maximum date for each dealer. Now you compare this series with your date column which will return a boolean series. Pass the boolean series to the df and you get the rows where the condition series == df['date'] is true. – Vaishali Feb 13 '18 at 16:40
  • This method only works in this particular case where there is only two distinct dealers, but consider a case where there are many dealers and many different dates. This method naively checks for the max date of the whole dataframe to make a boolean series. So then when you pass the bollean series to the whole dataframe, it only checks if this max date of all dates exists in the dataframe and will result in a huge loss of data. I dont think this is what we want. – Ben Smith Apr 06 '20 at 15:39
10

Here https://stackoverflow.com/a/41531127/9913319 is more correct solution:

df.sort_values('date').groupby('dealer').tail(1)
Rufat
  • 536
  • 1
  • 8
  • 25
1

Tack 1

Sort by dealer and by date before using drop_duplicates. This is blind to the issue that surfaces in Tack 2, below since there is no possibility for multiple records for each dealer in this method. This may or may not be an issue for you depending on your data and your use case.

df.sort_values(['dealer', 'date'], inplace=True)
df.drop_duplicates(['dealer', 'date'], inplace=True)

Tack 2

This is a worse way to do it with a groupby and a merge. Use groupby to find the max date for each dealer. We use the how='inner' parameter to only include those dealer and date combinations that appear in the groupby object that contains the maximum date for each dealer.

However, please note that this will return multiple records per dealer if the max date is duplicated in the original table. You might need to use drop_duplicates depending on your data and your use case.

df.merge(df.groupby('dealer')['date'].max().reset_index(), 
                             on=['dealer', 'date'], how='inner')

   invoice_no  dealer  billing_change_previous_month        date
0         100       1                         -41981  2017-01-30
1        5505       2                              0  2017-01-30
3novak
  • 2,506
  • 1
  • 17
  • 28