79

I want to perform a join/merge/append operation on a dataframe with datetime index.

Let's say I have df1 and I want to add df2 to it. df2 can have fewer or more columns, and overlapping indexes. For all rows where the indexes match, if df2 has the same column as df1, I want the values of df1 be overwritten with those from df2.

How can I obtain the desired result?

Mike Graham
  • 73,987
  • 14
  • 101
  • 130
saroele
  • 9,481
  • 10
  • 29
  • 39

4 Answers4

72

How about: df2.combine_first(df1)?

In [33]: df2
Out[33]: 
                   A         B         C         D
2000-01-03  0.638998  1.277361  0.193649  0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05  0.435507 -0.025162 -1.112890  0.324111
2000-01-06 -0.210756 -1.027164  0.036664  0.884715
2000-01-07 -0.821631 -0.700394 -0.706505  1.193341
2000-01-10  1.015447 -0.909930  0.027548  0.258471
2000-01-11 -0.497239 -0.979071 -0.461560  0.447598

In [34]: df1
Out[34]: 
                   A         B         C
2000-01-03  2.288863  0.188175 -0.040928
2000-01-04  0.159107 -0.666861 -0.551628
2000-01-05 -0.356838 -0.231036 -1.211446
2000-01-06 -0.866475  1.113018 -0.001483
2000-01-07  0.303269  0.021034  0.471715
2000-01-10  1.149815  0.686696 -1.230991
2000-01-11 -1.296118 -0.172950 -0.603887
2000-01-12 -1.034574 -0.523238  0.626968
2000-01-13 -0.193280  1.857499 -0.046383
2000-01-14 -1.043492 -0.820525  0.868685

In [35]: df2.comb
df2.combine        df2.combineAdd     df2.combine_first  df2.combineMult    

In [35]: df2.combine_first(df1)
Out[35]: 
                   A         B         C         D
2000-01-03  0.638998  1.277361  0.193649  0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05  0.435507 -0.025162 -1.112890  0.324111
2000-01-06 -0.210756 -1.027164  0.036664  0.884715
2000-01-07 -0.821631 -0.700394 -0.706505  1.193341
2000-01-10  1.015447 -0.909930  0.027548  0.258471
2000-01-11 -0.497239 -0.979071 -0.461560  0.447598
2000-01-12 -1.034574 -0.523238  0.626968       NaN
2000-01-13 -0.193280  1.857499 -0.046383       NaN
2000-01-14 -1.043492 -0.820525  0.868685       NaN

Note that it takes the values from df1 for indices that do not overlap with df2. If this doesn't do exactly what you want I would be willing to improve this function / add options to it.

Scott Martin
  • 1,260
  • 2
  • 17
  • 27
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
54

For a merge like this, the update method of a DataFrame is useful.

Taking the examples from the documentation:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, 2.1, np.nan],
                   [np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],
                   index=[1, 2])

Data before the update:

>>> df1
     0    1    2
0  NaN  3.0  5.0
1 -4.6  2.1  NaN
2  NaN  7.0  NaN
>>>
>>> df2
      0    1    2
1 -42.6  NaN -8.2
2  -5.0  1.6  4.0

Let's update df1 with data from df2:

df1.update(df2)

Data after the update:

>>> df1
      0    1    2
0   NaN  3.0  5.0
1 -42.6  2.1 -8.2
2  -5.0  1.6  4.0

Remarks:

  • It's important to notice that this is an operation "in place", modifying the DataFrame that calls update.
  • Also note that non NaN values in df1 are not overwritten with NaN values in df2
Ben
  • 20,038
  • 30
  • 112
  • 189
Nicolás Ozimica
  • 9,481
  • 5
  • 38
  • 51
  • 8
    This is more intuitive than `combine_first` because it acts exactly as the `update` method we know from dicts. – saroele Apr 03 '17 at 12:06
  • 3
    Note, that DataFrame.update() is a bit different than dict because it doesn't add new index items the same way dicts add new keys. Example: https://gist.github.com/flutefreak7/b4eb6a93565c375d79b791c2bbd672b1 – flutefreak7 Oct 30 '20 at 19:23
  • 4
    you should use `update` only if you are happy to exclude indexes unique to df2 Use `combine_first` if you want to include them – Colin Anthony Feb 04 '22 at 07:07
0

I did not find a satisfactory answer for this task yet where one does not need to pre-specify which columns should be used. I am doing it by combining merge, rename, and drop. df_some_wrong is a dataframe with important new columns. However, it has some old columns that have wrong values and should be overwritten with the correct values contained in df_correct. Only the columns with wrong values in df_some_wrong are shared between df_some_wrong and df_correct.

wrong_suffix = "_wrong"
correct_suffix = "_correct"
# merge, rename shared columns with suffixes
merged_df = df_some_wrong.merge(
    df_correct.
    left_index=True,
    right_index=True,
    suffixes=(wrong_suffix, correct_suffix),
)
# find all the renamed columns from df_correct
rename_and_drop_cols = [
    col for col in merged_df.columns if col.endswith(correct_suffix)
]
# dict that describes how correct columns should be renamed
# to old column name
rename_cols = {
    key: key.replace(correct_suffix, "") for key in rename_and_drop_cols
}
# list describing the wrong columns that can be dropped
drop = [
    col.replace(correct_suffix, wrong_suffix) for col in rename_and_drop_cols
]
# finally rename and drop
merged_df = merged_df.rename(rename, axis=1).drop(drop, axis=1)
0

I've written a function, which should handle all cases (different lengths, nan values) and could be used inside an appending loop. However, it will prefer first df if data is conflicting.

def real_merge(df1,df2,on):
    if len(df1)==0:
        return df2
    elif len(df2)==0:
        return df1
    else:
        df3=pd.merge(df1,df2,on=on,how='outer')
        #now we have a mess to fix
        cols=[x[:-2] for x in df3.columns if x.endswith('_x')]
        for i_col in cols:
            df3.loc[:,i_col+'_x']=df3[i_col+'_x'].combine_first(df3[i_col+'_y'])
            df3.rename(columns={i_col+'_x':i_col},inplace=True)
            df3.drop(columns=[i_col+'_y'],inplace=True)
        return df3

Why is there no native pandas function for that? merge/update/cobmine_first are not sufficient for such a task which is requested so often on SE.