1

I have a dataframe (df01) organized as :

Index  |    colA   |   colB   |
-------------------------------
  ---  |   aaa_01  |  bbb_01  |
  ---  |   aaa_02  |  bbb_02  |
  ---  |   aaa_01  |  bbb_01  |  
  ---  |   aaa_01  |  bbb_01  |  
  ---  |   aaa_02  |  bbb_02  |
  ---  |   aaa_03  |  bbb_03  |
  ---  |   aaa_04  |  bbb_04  |  
  ---  |   aaa_02  |  bbb_02  |
  ---  |   aaa_02  |  bbb_02  |
  ---  |   aaa_01  |  bbb_01  |  

and another dataframe (df02), with two columns corresponding to colA and colB from the first database, and an additional column

Index  |    colA   |   colB   |  colC
---------------------------------------
  ---  |   aaa_01  |  bbb_01  |  ccc_01
  ---  |   aaa_02  |  bbb_02  |  ccc_02
  ---  |   aaa_03  |  bbb_03  |  ccc_03
  ---  |   aaa_03  |  bbb_03  |  ccc_03

For each row of the first database, I have to find the matching row in the second database (based on the values of colA and colB), and add the value of colC in the first database...basically, I want to end up with:

Index  |    colA   |   colB   |  colC
---------------------------------------
  ---  |   aaa_01  |  bbb_01  |  ccc_01
  ---  |   aaa_02  |  bbb_02  |  ccc_02
  ---  |   aaa_01  |  bbb_01  |  ccc_01
  ---  |   aaa_01  |  bbb_01  |  ccc_01  
  ---  |   aaa_02  |  bbb_02  |  ccc_02
  ---  |   aaa_03  |  bbb_03  |  ccc_03
  ---  |   aaa_04  |  bbb_04  |  ccc_04
  ---  |   aaa_02  |  bbb_02  |  ccc_02
  ---  |   aaa_02  |  bbb_02  |  ccc_02
  ---  |   aaa_01  |  bbb_01  |  ccc_01  

So far, I tried something like

C_list = []
for row in range(df_01.shape[0]):
    a = df_01.iloc[row, 1]
    b = df_01.iloc[row, 2]
    c = df_02[(df_02['colA']==a) & (df_02['colB']==b)]
    C_list.append(c)

df_01['colC'] = np.array(C_list)

but I got some errors, plus it's really not very pythonic...I'm sure there's an easier way to do it?

Carlo
  • 1,321
  • 12
  • 37
  • aaa_03, bbb_03, aaa_04, bbb_04, do not exist in your 2nd DF, why do you have them in your result as coming from colC? They are not present in your 2nd DF, so the result from colC should be blank, and not ccc_03, ccc_04 – sophocles Jan 27 '21 at 15:53
  • Sorry, I edited the question, they do exist in the second database – Carlo Jan 27 '21 at 15:55

1 Answers1

2

I think you are looking for a left_join:

res = pd.merge(df,df2,how='left',on = ['colA','colB'])

     colA    colB    colC
0  aaa_01  bbb_01  ccc_01
1  aaa_02  bbb_02  ccc_02
2  aaa_01  bbb_01  ccc_01
3  aaa_01  bbb_01  ccc_01
4  aaa_02  bbb_02  ccc_02
5  aaa_03  bbb_03  ccc_03
6  aaa_04  bbb_04  ccc_04
7  aaa_02  bbb_02  ccc_02
8  aaa_02  bbb_02  ccc_02
9  aaa_01  bbb_01  ccc_01
sophocles
  • 13,593
  • 3
  • 14
  • 33