0

I have two data frames having the same set of columns, and I want to do a left join to identify the rows in the first dataframe which are not present in the second one.

mylog = {
'date': ['2020-12-05', '2020-12-05', '2020-12-05', '2020-12-06', '2020-12-06', '2020-12-06', '2020-12-06', '2020-12-07', '2020-12-07', '2020-12-07'],
'rider_id': ['10554','10372','10372','19042','12544','10554','10996','10994','15498', '10554'],
'issue_id': [1,1,3,2,5,1,8,12,1,3] 
}
 
dblog = {
'date': ['2020-12-05', '2020-12-05', '2020-12-05', '2020-12-06', '2020-12-06', '2020-12-06', '2020-12-06', '2020-12-07'],
'rider_id': ['10554','10372','10372','19042','12544','10554','10996','10994'],
'issue_id': [1,3,3,2,5,1,8,12] 
}

All the three column entries together form an unique entry.

mylog_df = pd.DataFrame.from_dict(mylog)
dblog_df = pd.DataFrame.from_dict(dblog)

merge_df = mylog_df.merge(dblog_df, how = 'left', left_on = ['date', 'rider_id', 'issue_id'], right_on = ['date','rider_id', 'issue_id'])
merge_df

This returns all the ten rows of the mylog_df and just the three columns. I want the left join to return a dataframe from which I can segregate those entries that are not present in the dblog_df.

  • Add a column of 1s in dblog_df before the merge. Then after the merge if there is still a 1 in it, it means that it was present in dblog_df, if there is NaN it means it was not present. – Ismael EL ATIFI Dec 22 '20 at 10:11
  • And since the key columns names are the same in left and right dfs, you can just use on=['date', 'rider_id', 'issue_id'] instead of repeating columns in left_on and right_on. – Ismael EL ATIFI Dec 22 '20 at 10:14
  • That didn't work, now I am getting three rows with NaN while actually only the last two rows have been added in 'mylog' – Rishi4Python Dec 22 '20 at 11:00
  • It is because the merge keys are not unique in dblog_df. You can see it by adding parameter validate="one_to_one" in the merge. You can use < dblog_df = dblog_df.drop_duplicates(['date', 'rider_id', 'issue_id']) > to remove duplicates in dblog_df before the merge. – Ismael EL ATIFI Dec 22 '20 at 11:33
  • See, I want the key to be the entries of all the columns taken at once. If you take the date, rider_id and issue_id together, then they are all unique. – Rishi4Python Dec 22 '20 at 11:58
  • No the keys are not unique in dblog_df apparently, you can check it with validate="one_to_one". You need to drop duplicates in dblog_df before merge. And then my solution works. – Ismael EL ATIFI Dec 22 '20 at 12:19

0 Answers0