Match value in column between 2 datasets, update multiple columns

Question

I have two dataframes:

df1
     A   B     id col1 col2 col3 
0   E1  E2     0   NaN  NaN  NaN
1   E1  E3     1   NaN  NaN  NaN
2   E1  E4     2   NaN  NaN  NaN 
3   E2  E1     3   NaN  NaN  NaN
4   E1  E4     4   NaN  NaN  NaN
5   E2  E1     5   NaN  NaN  NaN

df2
     A   B     id col1 col2 col3 
0   E1  E2     3     1    0    1
1   E1  E3     5     0    1    1

I want to update the values in col1, col2, col3 in df1 by taking those values in df2 by matching on id to get:

df3
     A   B     id col1 col2 col3 
0   E1  E2     0   NaN  NaN  NaN
1   E1  E3     1   NaN  NaN  NaN
2   E1  E4     2   NaN  NaN  NaN 
3   E2  E1     3     1    0    1
4   E1  E4     4   NaN  NaN  NaN
5   E2  E1     5     0    1    1

As my actual dataframe is much larger, I want to use the list of the column names that I would like to update:

add = ['col1', 'col2', 'col3']

How can I use this column names to get the desired result?

I referred to this question and this question which directed me to use .loc but I can't figure out how to incorporate a reference to the index and the list of multiple columns a la:

df1.loc[df1['edge_id'] == df2['edge_id'], add] = df2[add]

Obviously this didn't work...

cs95 · Answer 1 · 2017-12-15T11:42:35.143

2

Here's one solution with combine_first -

i = df1.set_index('id')
j = df2.drop(['A', 'B'], 1).set_index('id') # df2.set_index('id')[add]

df = i.combine_first(j).reset_index()
df

   id   A   B  col1  col2  col3
0   0  E1  E2   NaN   NaN   NaN
1   1  E1  E3   NaN   NaN   NaN
2   2  E1  E4   NaN   NaN   NaN
3   3  E2  E1   1.0   0.0   1.0
4   4  E1  E4   NaN   NaN   NaN
5   5  E2  E1   0.0   1.0   1.0

If the order of columns matters, reindex -

df = i.combine_first(j).reset_index().reindex(columns=df1.columns)
df

    A   B  id  col1  col2  col3
0  E1  E2   0   NaN   NaN   NaN
1  E1  E3   1   NaN   NaN   NaN
2  E1  E4   2   NaN   NaN   NaN
3  E2  E1   3   1.0   0.0   1.0
4  E1  E4   4   NaN   NaN   NaN
5  E2  E1   5   0.0   1.0   1.0

edited Dec 15 '17 at 11:42

answered Dec 15 '17 at 11:35

cs95

379,657
97
704
746

Let me know about views on concat option – Bharath M Shetty Dec 15 '17 at 12:21
Thanks for solution Coldspeed. I ended using Jez's, only because I could not guarantee knowing in advance which columns to drop. Though I think his is like your second one too. They all work well. Really appreciate your answering :) – Chuck Dec 15 '17 at 14:57

score 2 · Accepted Answer · answered Dec 15 '17 at 11:36

Use combine_first:

df1[add] = df1.set_index('id')[add].combine_first(df2.set_index('id')[add]).values
print (df1)
    A   B  id  col1  col2  col3
0  E1  E2   0   NaN   NaN   NaN
1  E1  E3   1   NaN   NaN   NaN
2  E1  E4   2   NaN   NaN   NaN
3  E2  E1   3   1.0   0.0   1.0
4  E1  E4   4   NaN   NaN   NaN
5  E2  E1   5   0.0   1.0   1.0

score 2 · Answer 3 · answered Dec 15 '17 at 12:19

2

We can use concat and drop duplicates i.e

df3 = pd.concat([df,df2]).drop_duplicates(['id'],keep='last').reset_index(drop=True)

 A   B  id  col1  col2  col3
0  E1  E2   0   NaN   NaN   NaN
1  E1  E3   1   NaN   NaN   NaN
2  E1  E4   2   NaN   NaN   NaN
3  E1  E4   4   NaN   NaN   NaN
4  E1  E2   3   1.0   0.0   1.0
5  E1  E3   5   0.0   1.0   1.0

answered Dec 15 '17 at 12:19

Bharath M Shetty

30,075
6
57
108

1

Not a bad option, though I'm a bit worried about how this will generalise to OP's actual data. – cs95 Dec 15 '17 at 12:25
Thanks for solution Dark :) – Chuck Dec 15 '17 at 14:57

Match value in column between 2 datasets, update multiple columns

3 Answers3