0

I am trying to update a dataframe df_1 with values contained in a dataframe df_2.

df_1      
ID    B    val    val2       
0     1     2      2 
1     1     2      3
1     2     1      1
1     3     1      1
1     4     1      1

df_2      
ID    B    val    val2       
0     1     1      3 
1     1     3      3
1     3     3      3

The result I want is reported below in df_3:

df_3      
ID    B    val    val2       
0     1     1      3 
1     1     3      3
1     2     1      1
1     3     3      3
1     4     1      1

As you can the join columns are ID and B, and when there is a match in df_2 I substitute the values. Regarding values, df_2 contains the exact same columns of df_1.

There are two questions that mention that problem: Python pandas join on with overwrite and Python pandas - particular merge/replacement. These questions mention the exact same problem that I have, however when I try the solutions reported the join takes to much time, I have a huge dataset and I personally think that the mentioned solutions are suitable for small chunks of data. Furthermore, these questions are old, so I was wondering if there is any news in pandas that can speed-up this process.

Guido Muscioni
  • 1,203
  • 3
  • 15
  • 37

1 Answers1

3

You need using cumcount create the additional keys since the ID is not unique , then we do concat with drop_duplicates

df1['keys']=df1.groupby('ID').cumcount()
df2['keys']=df2.groupby('ID').cumcount()
pd.concat([df1,df2]).drop_duplicates(['ID','keys'],keep='last').drop('keys',1).sort_index()
Out[31]: 
   ID  B  val  val2
0   0  1    1     3
1   1  1    3     3
2   1  2    1     1

Update

df1.set_index(['ID','B'],inplace=True)
df1.update(df2.set_index(['ID','B']))
df1=df1.reset_index()
df1
Out[258]: 
   ID  B  val  val2
0   0  1  1.0   3.0
1   1  1  3.0   3.0
2   1  2  1.0   1.0
3   1  3  3.0   3.0
4   1  4  1.0   1.0
BENY
  • 317,841
  • 20
  • 164
  • 234