Add rows from one dataframe to another based on missing values in a given column pandas

Question

I have been searching a long time for an answer but could not find it. I have two dataframes, one is target, the other backup which both have the same columns. What I want to do is to look at a given column and add all the rows from backup to target which are not in target. The most straightforward solution for this is:

import pandas as pd
import numpy as np

target = pd.DataFrame({
         "key1": ["K1", "K2", "K3", "K5"],
         "A": ["A1", "A2", "A3", np.nan],
         "B": ["B1", "B2", "B3", "B5"],
     })

backup = pd.DataFrame({
         "key1": ["K1", "K2", "K3", "K4", "K5"],
         "A": ["A1", "A", "A3", "A4", "A5"],
         "B": ["B1", "B2", "B3", "B4", "B5"],
     })

merged = target.copy()

for item in backup.key1.unique():
    if item not in target.key1.unique():
        merged = pd.concat([merged, backup.loc[backup.key1 == item]])

merged.reset_index(drop=True, inplace=True)

giving

  key1    A   B
0   K1   A1  B1
1   K2   A2  B2
2   K3   A3  B3
3   K5  NaN  B5
4   K4   A4  B4

Now I have tried several things using just pandas where none of them works.

pandas concat

# Does not work because it creates duplicate lines and if dropped, the updated rows which are different will not be dropped -- compare the line with A or NaN

pd.concat([target, backup]).drop_duplicates()

  key1  A   B
0   K1  A1  B1
1   K2  A2  B2
2   K3  A3  B3
3   K5  NaN B5
1   K2  A   B2
3   K4  A4  B4
4   K5  A5  B5

pandas merge

# Does not work because the backup would overwrite data in the target -- NaN

pd.merge(target, backup, how="right")

  key1  A   B
0   K1  A1  B1
1   K2  A   B2
2   K3  A3  B3
3   K4  A4  B4
4   K5  A5  B5

Importantly, it is not a duplicate of this post since I do not want to have a new column and more importantly, the values are not NaN in target, they are simply not there. Furthermore, if then I would use what is proposed for merging the columns, the NaN in the target would be replaced by the value in backup which is unwanted.
It is not a duplicate of this post which uses the combine_first pandas because in that case the NaN is filled by the value from the backup which is wrong:

target.combine_first(backup)

   key1 A   B
0   K1  A1  B1
1   K2  A2  B2
2   K3  A3  B3
3   K5  A4  B5
4   K5  A5  B5

Lastly,

target.join(backup, on=["key1"])

gives me an annoying

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

which I really do not get since both are pure strings and the proposed solution does not work.

So I would like to ask, what am I missing? How can I do it using some pandas methods? Thanks a lot.

score 2 · Accepted Answer · edited Apr 11 '23 at 11:38

2

Use concat with filtered backup rows that do not exist in target.key1 filtered by Series.isin in boolean indexing:

merged = pd.concat([target, backup[~backup.key1.isin(target.key1)]])
print (merged)
  key1    A   B
0   K1   A1  B1
1   K2   A2  B2
2   K3   A3  B3
3   K5  NaN  B5
3   K4   A4  B4

edited Apr 11 '23 at 11:38

My Work

2,143
2
19
47

answered Jan 18 '21 at 12:50

jezrael

822,522
95
1,334
1,252

score 0 · Answer 2 · answered Jan 18 '21 at 12:55

0

Maybe you can try this with a 'subset' parameter in df.drop_duplicates()?

pd.concat([target, backup]).drop_duplicates(subset = "key1")

which gives output:

  key1    A   B
0   K1   A1  B1
1   K2   A2  B2
2   K3   A3  B3
3   K5  NaN  B5
3   K4   A4  B4

answered Jan 18 '21 at 12:55

Michał89

87
8

Hi, thanks for the answer. For some reason, this does not work on my dataset,it shows the complement (or something of that sort). I have not yet figured out why or why does my MRE not cover that but I wanted to state it in case someone was faster. – My Work Jan 27 '21 at 14:34

Add rows from one dataframe to another based on missing values in a given column pandas

2 Answers2