2

I have two dataframes df1 and df2.

np.random.seed(0)
df1= pd.DataFrame({'key': ['A', 'B', 'C', 'D'],'id': ['2', '23', '234', '2345'], '2021': np.random.randn(4)})
df2= pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'id': ['23', '2345', '67', '45'],'2022': np.random.randn(4)})

  key    id      2021
0   A     2  1.764052
1   B    23  0.400157
2   C   234  0.978738
3   D  2345  2.240893

  key    id      2022
0   B    23  1.867558
1   D  2345 -0.977278
2   E    67  0.950088
3   F    45 -0.151357

I want to have unique keys. If key found already just update the key else insert new row. I am not sure if I have to use merge/concat/join. Can anyone give insight on this please?

Note:I have used full outer join, it returns duplicate columns. Have edited the input dataframes after posting the question.

Thanks!

Poongodi
  • 67
  • 1
  • 8

3 Answers3

4

You can do it using merge function:

df = df1.merge(df2, on='key', how='outer')

df
   key     2021    2022
0   A   1.764052    NaN
1   B   0.400157    1.867558
2   C   0.978738    NaN
3   D   2.240893    -0.977278
4   E   NaN         0.950088
5   F   NaN        -0.151357

EDIT

In case you need to merge also for 'id':

df = df1.merge(df2, on=['key','id'], how='outer')

key id  2021    2022
A   2   1.764052    NaN
B   23  0.400157    1.867558
C   234 0.978738    NaN
D   2345    2.240893    -0.977278
E   67  NaN 0.950088
F   45  NaN -0.151357
Andrea Ierardi
  • 420
  • 3
  • 10
  • 1
    Have they changed somehow the functioning of the merge function ? The solution that you propose returns id_x and id_y columns instead of only id (at least for pandas 1.4.2). – Christopher H Feb 10 '23 at 13:23
  • @ChristopherH thank you. You're right! You can solve it by adding 'id' in the on paraemter: `df = df1.merge(df2,on=['key', 'id'], how='outer')` – Andrea Ierardi Feb 13 '23 at 17:15
3

I think you need create index from key and then join in concat:

df = pd.concat([df1.set_index('key'), df2.set_index('key')], axis=1).reset_index()
print (df)
  key      2021      2022
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Given your description, it looks like you want combine_first. It will merge the two datasets by replacing the duplicates in order.

df2.set_index('key'). combine_first(df1.set_index('key')).reset_index()

Output:

  key      2021      2022
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
mozway
  • 194,879
  • 13
  • 39
  • 75