Method 1: combine_first
index_cols = ['test', 'Cond']
(
df1
.set_index(index_cols)
.combine_first(
df2.set_index(index_cols)
).reset_index()
)
Explanation:
set_index
moves the specified columns to the index, indicating that each row should be identified by its test
and Cond
columns.
foo.combine_first(bar)
will identify matching index + column labels between foo
and bar
, and fill in values from bar
wherever foo
is NaN or has a column/row missing. In this case, thanks to the set_index
, the two dataframes will have their rows matched where test
and Cond
are the same, and then the UL
and LL
values from df2
will be filled in to the corresponding columns of the output.
reset_index
simply reverses the set_index
call, so that test
and Cond
become regular columns again.
Note that this operation might mangle the order of your columns, so if that is important to you then you can call .reindex(df1.columns, axis=1)
at the very end, which will reorder the columns to original order in df1
.
Method 2: merge
Alternatively you can use the merge
method, which allows you to operate on the columns directly without using set_index
, but will require some other preprocessing:
index_cols = ['test', 'Cond']
(
df1
.drop(['LL', 'UL'], axis=1)
.merge(
df2,
on=index_cols
)
)
The .drop
call is necessary because otherwise merge
will include the UL
and LL
columns from both DataFrames in the output:
test Cond day mode LL_x UL_x LL_y UL_y
0 a T1 Tue 7 NaN NaN 15.0 23.0
1 b T2 mon 7 NaN NaN -3.0 -3.5
2 c T2 sun 6 NaN NaN -19.0 -11.0
3 c T2 sat 6 NaN NaN -19.0 -11.0
4 d T3 fri 3 NaN NaN 6.5 14.5
5 d T3 wed 3 NaN NaN 6.5 14.5
Which to use?
With the data that you have provided, merge
seems like the more natural operation - if you never expect UL and LL to have any data in df1
, then if possible I'd recommend simply removing those column headers entirely from the input CSV, so that df1
doesn't have those columns at all. In that case, the drop
call would no longer be necessary and the required merge
call is very expressive.
However, if you expect that df1
would sometimes have real values for UL
or LL
, and you want to include those values in the output, then the combine_first
solution is what you want. Note that if both df1
and df2
have different non-null values for a particular row/column, then the df1.combine_first(df2)
will select the value from df1
and ignore the df2
value. If you instead wanted to prioritise the values from df2
then you want to call it the other way round, i.e. df2.combine_first(df1)
.