1

I want to left join Dataframe 'Raw' with another dataframe 'acknowledgement'

Raw

LogId
1233
1234
1235

acknowledgement

LogId  Reason       Time
1233   Valid        13/09/20 09:45:19
1233   Not Valid    13/09/20 17:23:20
1234   Not Valid    25/09/20 14:56:34
1234   Valid        21/09/20 12:23:12

My code:

pd.merge(raw,acknowledgement,on='Log Id',how='left')

Now when i do the left join on 'Log Id' the values are getting duplicated.

The desired output should take only the latest days remarks for that particular 'Log Id'

LogId  Reason       Time
1233   Not Valid    13/09/20 17:23:20
1234   Not Valid    25/09/20 14:56:34
1235   #N/A          #N/A
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Harish reddy
  • 431
  • 2
  • 9
  • This is almost a dup as @NYCCoder pointed out. Drop the duplicates on `acknowledgement` then merge. – Quang Hoang Oct 28 '20 at 14:20
  • My opinion is it is duplicate, because merge solution is in question, only need `ack = acknowledgement.loc[acknowledgement.groupby('LogId')['Time'].idxmax()]`, but is possible I am wrong – jezrael Oct 28 '20 at 14:23

1 Answers1

2

Let's try groupby + idxmax to get the indices of max values of Time per LogId then use these indices to filter the dataframe then left merge it with raw :

ack = acknowledgement.loc[acknowledgement.groupby('LogId')['Time'].idxmax()]
raw.merge(ack, on='LogId', how='left')

Alternatively you can also try:

m = acknowledgement.sort_values('Time').duplicated('LogId', keep='last')
raw.merge(acknowledgement[~m], on='LogId', how='left')

   LogId     Reason                Time
0   1233  Not Valid 2020-09-13 17:23:20
1   1234  Not Valid 2020-09-25 14:56:34
2   1235        NaN                 NaT
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/223791/discussion-on-answer-by-shubham-sharma-merge-based-on-latest-date-in-pandas). – Samuel Liew Oct 29 '20 at 00:30