0

I have a program that runs completely through but unfortunately is returning duplicates because of the way the base data is structured. Results looks like this:

   Date      Amount   Source   Type
  7/16/2019  10        A       B
  7/17/2019  10        A       B
  7/15/2019  10        A       B
  7/15/2019  10        B       B

I'd like to return:
   Date      Amount   Source   Type
  7/17/2019   10        A       B
  7/15/2019   10        B       B

7/17/2019 is chosen because its the last date we received 10 from source A and Type B.

I've tried:

df.drop_duplicates(subset='a','b','date', keep="last") 

but its not quite working. Is there a better way to do this?

This worked

df[df.Date.eq(df.groupby(['Source','Type'])['Date'].transform('max'))]
user2679225
  • 159
  • 2
  • 4
  • 12

2 Answers2

1

As stated in This post:

non_duplicate_index = ~df.index.duplicated(keep='first')
result = df.loc[non_duplicate_index]

df.index.duplicated(keep='first') Returns an index containing True/False values. True if the value is duplicated, and False otherwise. Then, ~df.index.duplicated(keep='first') returns True where the value is not duplicated.

Finally df.loc[non_duplicate_index] is an slicing pandas method which returns the df rows where non_duplicate_index is True.

1

drop_duplicates will work as well

df.sort_values('Date').drop_duplicates(subset=['Source','Type'], keep="last") 
Out[566]: 
        Date  Amount Source Type
3 2019-07-15      10      B    B
1 2019-07-17      10      A    B
BENY
  • 317,841
  • 20
  • 164
  • 234