0

I'm trying to merge 2 DataFrames following Pandas Merging 101 .

To train, I tried on 2 smalls Dataframe:

df1
    A   B    C
0   4   64   51
1   7   1    6
2   47  21   56
3   96  16   4
4   2   6    4
df2
    A    B    C    D
0   8    6    8    7
1   7    1    14   5
2   8    8    8    8
3   96   16   32   4

I want to check all values of columns A and B of df1, if they correspond to columns A and B of df2, then I create a new column D in dataFrame df2 that correspond to the value of D. If the values does not correspond to any rows in df2 then I store Nan.

I used this:

df3= df1.merge(df2[['A', 'B','D']], on= ['A', 'B'], how= 'left')

And this give me the expected output

OUTPUT:

df3

   A   B   C   D
0  4   64  51  NaN
1  7   1   6   5.0
2  47  21  56  NaN
3  96  16  4   4.0
4  2   6   4   NaN

Now I tried the same procedure but on real world datasets:

df1.shape  -> (1000, 92)
df2.shape  -> (1744, 22)
df3= df1.merge(df2[['NUM_WB', 'WB','TENSION', 'COURANT']], on= ['NUM_WB', 'WB','TENSION'], how= 'left')
df3.shape  -> (72162, 93)

I expected df3 to have the same number of rows as df1. How to explain this?

QQ1821
  • 143
  • 9
  • 1
    This means you have duplicated rows on the subset of merging columns, you can drop the duplicates in `df2` on this subset or accept to have more combinations. – mozway Oct 13 '22 at 07:14

0 Answers0