I have two dataframes which contain transactions values ordered by dates. The first one (say the master) has added columns. The second (the update) one contains the same rows as the master, but also new rows.
I would like to update the master with the rows of the update, based on the column values which aren't the added columns. I need to keep the duplicates in both dataframes (as sometimes there is several times the exact same transaction the same day), but remove the rows that are overlapping duplicates between the two dataframes. I would also like to preserve the transactions order.
Example:
MASTER
date notification amount added_info
0 15/10/2021 string1 30.00 XX
1 15/10/2021 string1 30.00 XX
2 15/10/2021 string2 25.00 XX
3 11/10/2021 string3 3.00 YY
4 07/10/2021 string1 10.00
5 05/10/2021 string2 12.50
6 30/09/2021 string2 12.50 XX
UPDATE
date notification amount
0 24/12/2021 string2 20.00
1 20/12/2021 string1 12.00
2 15/10/2021 string1 30.00
3 15/10/2021 string1 30.00
4 15/10/2021 string1 30.00
5 15/10/2021 string2 25.00
6 11/10/2021 string3 3.00
7 07/10/2021 string1 10.00
8 06/10/2021 string2 10.00
EXPECTED RESULT (UPDATED MASTER)
date notification amount added_info
0 24/12/2021 string2 20.00
1 20/12/2021 string1 12.00
2 15/10/2021 string1 30.00
3 15/10/2021 string1 30.00 XX
4 15/10/2021 string1 30.00 XX
5 15/10/2021 string2 25.00 XX
6 11/10/2021 string3 3.00 YY
7 07/10/2021 string1 10.00
8 06/10/2021 string2 10.00
9 05/10/2021 string2 12.50
10 30/09/2021 string2 12.50 XX
Explanation:
It added rows 0, 1, 2 and 7 of "update", inserting them at the right chronological row, didn't use the infos in "added_info" to compare, didn't remove rows 0 and 1 of "master" (which are duplicates within master), but added row 2 of "update", which is a duplicate in both and but not by overlap.
Solutions with .concat()
followed by drop_duplicates()
will remove the individual duplicates in both dataframes.
The naive solution I think of is to do a for row in update:
to compare a concatenated string of values of date
, notification
and amount
for each row in update
to the ones in master
, mark both the rows in update
and master
every time a duplicate is found, break the loop and move on to the next row. Eventually, .concat()
the non-marked rows in update
to master
, and sort by date
.
It sounds very inefficient and complicated, I somehow expect pandas to have built-in method to do this efficiently, but I haven't figured the exact variation of .merge()
or .join()
to use.
Here are constructors for the example dataframes:
import pandas as pd
df_master = pd.DataFrame({'date': {0: '15/10/2021', 1: '15/10/2021', 2: '15/10/2021', 3: '11/10/2021', 4: '07/10/2021', 5: '05/10/2021', 6: '30/09/2021'}, 'notification': {0: 'string1', 1: 'string1', 2: 'string2', 3: 'string3', 4: 'string1', 5: 'string2', 6: 'string2'}, 'amount': {0: 30.0, 1: 30.0, 2: 25.0, 3: 3.0, 4: 10.0, 5: 12.5, 6: 12.5}, 'added_info': {0: 'XX', 1: 'XX', 2: 'XX', 3: 'YY', 4: '', 5: '', 6: 'XX'}})
df_update = pd.DataFrame({'date': {0: '24/12/2021', 1: '20/12/2021', 2: '15/10/2021', 3: '15/10/2021', 4: '15/10/2021', 5: '15/10/2021', 6: '11/10/2021', 7: '07/10/2021', 8: '06/10/2021'}, 'notification': {0: 'string2', 1: 'string1', 2: 'string1', 3: 'string1', 4: 'string1', 5: 'string2', 6: 'string3', 7: 'string1', 8: 'string2'}, 'amount': {0: 20.0, 1: 12.0, 2: 30.0, 3: 30.0, 4: 30.0, 5: 25.0, 6: 3.0, 7: 10.0, 8: 10.0}})
Thanks a lot!
EDIT:
Here is the naive approach I was talking about. It has the advantage of keeping the "added_info" column expendable, as in there could be more of them, or with other names (they remain untouched). The nice thing about this aspect is that it would let me expand the master dataframe in the future.
Let me know how bad this solution is. Thanks!
import pandas as pd
df_master = pd.DataFrame({'date': {0: '15/10/2021', 1: '15/10/2021', 2: '15/10/2021', 3: '11/10/2021', 4: '07/10/2021', 5: '05/10/2021', 6: '30/09/2021'}, 'notification': {0: 'string1', 1: 'string1', 2: 'string2', 3: 'string3', 4: 'string1', 5: 'string2', 6: 'string2'}, 'amount': {0: 30.0, 1: 30.0, 2: 25.0, 3: 3.0, 4: 10.0, 5: 12.5, 6: 12.5}, 'added_info': {0: 'XX', 1: 'XX', 2: 'XX', 3: 'YY', 4: '', 5: '', 6: 'XX'}})
df_update = pd.DataFrame({'date': {0: '24/12/2021', 1: '20/12/2021', 2: '15/10/2021', 3: '15/10/2021', 4: '15/10/2021', 5: '15/10/2021', 6: '11/10/2021', 7: '07/10/2021', 8: '06/10/2021'}, 'notification': {0: 'string2', 1: 'string1', 2: 'string1', 3: 'string1', 4: 'string1', 5: 'string2', 6: 'string3', 7: 'string1', 8: 'string2'}, 'amount': {0: 20.0, 1: 12.0, 2: 30.0, 3: 30.0, 4: 30.0, 5: 25.0, 6: 3.0, 7: 10.0, 8: 10.0}})
df_update["Dupl"] = ""
df_master_concat = pd.DataFrame()
df_update_concat = pd.DataFrame()
df_update_concat["string"] = df_update[["date", "notification", "amount"]].agg(lambda x: "".join(x.astype(str)), axis=1)
df_master_concat["string"] = df_master[["date", "notification", "amount"]].agg(lambda x: "".join(x.astype(str)), axis=1)
for indx, row in df_update_concat.iterrows():
if df_update_concat.iloc[indx]["string"] in df_master_concat["string"].values:
df_update.at[indx , "Dupl"] = "X"
df_master_concat.drop(df_master_concat.loc[df_master_concat['string'] == df_update_concat.iloc[indx]["string"]].iloc[0].name, inplace=True) # drop the first occurence
df_update = df_update[df_update.Dupl != "X"]
df_update.drop('Dupl', axis=1, inplace=True)
df_master= pd.concat([df_master, df_update], ignore_index=True).fillna("")
df_master["date"] = pd.to_datetime(df_master["date"], format="%d/%m/%Y")
df_master.sort_values(by='date', inplace = True, ascending=False)
df_master["date"] = df_master["date"].dt.strftime("%d/%m/%Y")
df_master = df_master.reset_index(drop=True)
print(df_master)