Overlap two dataframes based on certain column values but keep duplicates within each dataframe (pandas/python)

Question

I have two dataframes which contain transactions values ordered by dates. The first one (say the master) has added columns. The second (the update) one contains the same rows as the master, but also new rows.

I would like to update the master with the rows of the update, based on the column values which aren't the added columns. I need to keep the duplicates in both dataframes (as sometimes there is several times the exact same transaction the same day), but remove the rows that are overlapping duplicates between the two dataframes. I would also like to preserve the transactions order.

Example:

MASTER
       date       notification       amount     added_info
0    15/10/2021     string1          30.00          XX
1    15/10/2021     string1          30.00          XX
2    15/10/2021     string2          25.00          XX
3    11/10/2021     string3           3.00          YY
4    07/10/2021     string1          10.00
5    05/10/2021     string2          12.50
6    30/09/2021     string2          12.50          XX

UPDATE
       date       notification       amount
0    24/12/2021     string2          20.00 
1    20/12/2021     string1          12.00 
2    15/10/2021     string1          30.00 
3    15/10/2021     string1          30.00
4    15/10/2021     string1          30.00    
5    15/10/2021     string2          25.00    
6    11/10/2021     string3           3.00   
7    07/10/2021     string1          10.00
8    06/10/2021     string2          10.00

EXPECTED RESULT (UPDATED MASTER)
       date       notification       amount     added_info
0    24/12/2021     string2          20.00 
1    20/12/2021     string1          12.00
2    15/10/2021     string1          30.00
3    15/10/2021     string1          30.00          XX
4    15/10/2021     string1          30.00          XX
5    15/10/2021     string2          25.00          XX
6    11/10/2021     string3           3.00          YY
7    07/10/2021     string1          10.00
8    06/10/2021     string2          10.00
9    05/10/2021     string2          12.50
10   30/09/2021     string2          12.50          XX

Explanation:

It added rows 0, 1, 2 and 7 of "update", inserting them at the right chronological row, didn't use the infos in "added_info" to compare, didn't remove rows 0 and 1 of "master" (which are duplicates within master), but added row 2 of "update", which is a duplicate in both and but not by overlap.

Solutions with .concat() followed by drop_duplicates() will remove the individual duplicates in both dataframes.

The naive solution I think of is to do a for row in update: to compare a concatenated string of values of date, notification and amount for each row in update to the ones in master, mark both the rows in update and master every time a duplicate is found, break the loop and move on to the next row. Eventually, .concat() the non-marked rows in update to master, and sort by date.

It sounds very inefficient and complicated, I somehow expect pandas to have built-in method to do this efficiently, but I haven't figured the exact variation of .merge()or .join()to use.

Here are constructors for the example dataframes:

import pandas as pd

df_master = pd.DataFrame({'date': {0: '15/10/2021', 1: '15/10/2021', 2: '15/10/2021', 3: '11/10/2021', 4: '07/10/2021', 5: '05/10/2021', 6: '30/09/2021'}, 'notification': {0: 'string1', 1: 'string1', 2: 'string2', 3: 'string3', 4: 'string1', 5: 'string2', 6: 'string2'}, 'amount': {0: 30.0, 1: 30.0, 2: 25.0, 3: 3.0, 4: 10.0, 5: 12.5, 6: 12.5}, 'added_info': {0: 'XX', 1: 'XX', 2: 'XX', 3: 'YY', 4: '', 5: '', 6: 'XX'}})
df_update = pd.DataFrame({'date': {0: '24/12/2021', 1: '20/12/2021', 2: '15/10/2021', 3: '15/10/2021', 4: '15/10/2021', 5: '15/10/2021', 6: '11/10/2021', 7: '07/10/2021', 8: '06/10/2021'}, 'notification': {0: 'string2', 1: 'string1', 2: 'string1', 3: 'string1', 4: 'string1', 5: 'string2', 6: 'string3', 7: 'string1', 8: 'string2'}, 'amount': {0: 20.0, 1: 12.0, 2: 30.0, 3: 30.0, 4: 30.0, 5: 25.0, 6: 3.0, 7: 10.0, 8: 10.0}})

Thanks a lot!

EDIT:

Here is the naive approach I was talking about. It has the advantage of keeping the "added_info" column expendable, as in there could be more of them, or with other names (they remain untouched). The nice thing about this aspect is that it would let me expand the master dataframe in the future.

Let me know how bad this solution is. Thanks!

import pandas as pd

df_master = pd.DataFrame({'date': {0: '15/10/2021', 1: '15/10/2021', 2: '15/10/2021', 3: '11/10/2021', 4: '07/10/2021', 5: '05/10/2021', 6: '30/09/2021'}, 'notification': {0: 'string1', 1: 'string1', 2: 'string2', 3: 'string3', 4: 'string1', 5: 'string2', 6: 'string2'}, 'amount': {0: 30.0, 1: 30.0, 2: 25.0, 3: 3.0, 4: 10.0, 5: 12.5, 6: 12.5}, 'added_info': {0: 'XX', 1: 'XX', 2: 'XX', 3: 'YY', 4: '', 5: '', 6: 'XX'}})
df_update = pd.DataFrame({'date': {0: '24/12/2021', 1: '20/12/2021', 2: '15/10/2021', 3: '15/10/2021', 4: '15/10/2021', 5: '15/10/2021', 6: '11/10/2021', 7: '07/10/2021', 8: '06/10/2021'}, 'notification': {0: 'string2', 1: 'string1', 2: 'string1', 3: 'string1', 4: 'string1', 5: 'string2', 6: 'string3', 7: 'string1', 8: 'string2'}, 'amount': {0: 20.0, 1: 12.0, 2: 30.0, 3: 30.0, 4: 30.0, 5: 25.0, 6: 3.0, 7: 10.0, 8: 10.0}})

df_update["Dupl"] = ""

df_master_concat = pd.DataFrame()
df_update_concat = pd.DataFrame()
df_update_concat["string"] = df_update[["date", "notification", "amount"]].agg(lambda x: "".join(x.astype(str)), axis=1)
df_master_concat["string"] = df_master[["date", "notification", "amount"]].agg(lambda x: "".join(x.astype(str)), axis=1)

for indx, row in df_update_concat.iterrows():
    if df_update_concat.iloc[indx]["string"] in df_master_concat["string"].values:
        df_update.at[indx , "Dupl"] = "X"
        df_master_concat.drop(df_master_concat.loc[df_master_concat['string'] == df_update_concat.iloc[indx]["string"]].iloc[0].name, inplace=True) # drop the first occurence

df_update = df_update[df_update.Dupl != "X"]
df_update.drop('Dupl', axis=1, inplace=True)
df_master= pd.concat([df_master, df_update], ignore_index=True).fillna("")


df_master["date"] = pd.to_datetime(df_master["date"], format="%d/%m/%Y")
df_master.sort_values(by='date', inplace = True, ascending=False)
df_master["date"] = df_master["date"].dt.strftime("%d/%m/%Y")
df_master = df_master.reset_index(drop=True)

print(df_master)

score 1 · Answer 1 · answered Feb 02 '23 at 14:11

The use of merge and join isn't straightforward here because you want to keep track of the number of rows for each ("date", "notification", "amount") tuple. You can do this by using groupby on each of the dataframe. You'll add new rows where update has more rows than master:

df_master_gp = df_master.groupby(["date", "notification", "amount"]).agg(list)
df_update_gp = df_update.assign(nb_el=0).groupby(["date", "notification", "amount"]).agg(len)

df_out = df_master_gp.join(df_update_gp, how='outer')
df_out["added_info"] = df_out["added_info"].fillna("")
df_out["nb_el"] = df_out["nb_el"].fillna(0)
df_out["added_info"] = df_out.apply(
                        lambda x: list(x['added_info']) + [None]*max(int(x["nb_el"])-len(x['added_info']), 0)
                        , axis=1)
print(df_out.drop(columns="nb_el").explode("added_info").reset_index())

Output:

          date notification  amount added_info
0   05/10/2021      string2    12.5       None
1   06/10/2021      string2    10.0       None
2   07/10/2021      string1    10.0       None
3   11/10/2021      string3     3.0         YY
4   15/10/2021      string1    30.0         XX
5   15/10/2021      string1    30.0         XX
6   15/10/2021      string1    30.0       None
7   15/10/2021      string2    25.0         XX
8   20/12/2021      string1    12.0       None
9   24/12/2021      string2    20.0       None
10  30/09/2021      string2    12.5         XX

Thanks a lot, I have been trying your code and I still seem to get duplicates: for instance, unlike you, for the 15/10/2021 I get the rows of the 30.0 twice; two with None and two with the added value. Running the example I gave I get an output of 15 rows, not 10. Any idea why? — ChienMouille, Feb 05 '23 at 09:58
Do you use the same input as in the question? It's easier to reproduce the problem if you provide a constructor for your df. With `to_dict()` for example. Also see [this question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Tranbi, Feb 05 '23 at 10:49
yes, sorry, my bad, I had spaces in one of the dataframes values — ChienMouille, Feb 05 '23 at 10:53
ok, one more question: is there a way to do the same where "added_infos" would be a variable amount of columns, with potentially different values? (I would like to keep the possibility to grow the master dataframe without having to modify the update function. — ChienMouille, Feb 05 '23 at 10:58

Overlap two dataframes based on certain column values but keep duplicates within each dataframe (pandas/python)

1 Answers1