I have been searching a long time for an answer but could not find it. I have two dataframes, one is target
, the other backup
which both have the same columns. What I want to do is to look at a given column and add all the rows from backup
to target
which are not in target
. The most straightforward solution for this is:
import pandas as pd
import numpy as np
target = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K5"],
"A": ["A1", "A2", "A3", np.nan],
"B": ["B1", "B2", "B3", "B5"],
})
backup = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K4", "K5"],
"A": ["A1", "A", "A3", "A4", "A5"],
"B": ["B1", "B2", "B3", "B4", "B5"],
})
merged = target.copy()
for item in backup.key1.unique():
if item not in target.key1.unique():
merged = pd.concat([merged, backup.loc[backup.key1 == item]])
merged.reset_index(drop=True, inplace=True)
giving
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
4 K4 A4 B4
Now I have tried several things using just pandas where none of them works.
# Does not work because it creates duplicate lines and if dropped, the updated rows which are different will not be dropped -- compare the line with A or NaN
pd.concat([target, backup]).drop_duplicates()
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
1 K2 A B2
3 K4 A4 B4
4 K5 A5 B5
# Does not work because the backup would overwrite data in the target -- NaN
pd.merge(target, backup, how="right")
key1 A B
0 K1 A1 B1
1 K2 A B2
2 K3 A3 B3
3 K4 A4 B4
4 K5 A5 B5
Importantly, it is not a duplicate of this post since I do not want to have a new column and more importantly, the values are not
NaN
intarget
, they are simply not there. Furthermore, if then I would use what is proposed for merging the columns, theNaN
in thetarget
would be replaced by the value inbackup
which is unwanted.It is not a duplicate of this post which uses the combine_first pandas because in that case the
NaN
is filled by the value from thebackup
which is wrong:
target.combine_first(backup)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 A4 B5
4 K5 A5 B5
- Lastly,
target.join(backup, on=["key1"])
gives me an annoying
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
which I really do not get since both are pure strings and the proposed solution does not work.
So I would like to ask, what am I missing? How can I do it using some pandas
methods? Thanks a lot.