Why merging 2 data frames gives me one with triple the rows

Question

I have df1:

       x            y no.
0  -17.7    -0.785430  y1
1  -15.0 -3820.085000  y4
2  -12.5     2.138833  y3
..  ....     ........  ..
40  15.6     5.486901  y2
41  19.2     1.980686  y3
42  19.6     9.364718  y2

and df2:

       delta y     x
0     0.053884 -17.7
1     0.085000 -15.0
2     0.143237 -12.5
..    ........  ....
40    0.113099  15.6
41    0.102245  19.2
42    0.235282  19.6

They both have 43 rows, and x column is exactly the same on both.

Somehow when I merge them on x I get a df with 123 rows:

 x            y no.   delta y
0   -17.7    -0.785430  y1  0.053884
1   -15.0 -3820.085000  y4  0.085000
2   -12.5     2.138833  y3  0.143237
3   -12.4     1.721205  y3  0.251180
4   -12.1     2.227343  y2  0.127343
..    ...          ...  ..       ...
118  12.1     1.642526  y3  0.143886
119  14.4  2576.435000  y4  0.171000
120  15.6     5.486901  y2  0.113099
121  19.2     1.980686  y3  0.102245
122  19.6     9.364718  y2  0.235282

My input: final = df1.merge(df2, on="x") x float64 y float64 no. object dtype: object

delta y float64 x float64 dtype: object

x float64 y float64 no. object dtype: object

delta y float64 x float64 dtype: object

x float64 y float64 no. object dtype: object

delta y float64 x float64 dtype: object

df1 = pd.DataFrame({'x': {0: -17.7, 1: -15.0, 2: -12.5, 3: -12.4, 4: -12.1, 5: -11.2, 6: -8.9, 7: -7.5, 8: -7.5, 9: -6.0, 10: -6.0, 11: -4.7, 12: -4.1, 13: -3.8, 14: -3.4, 15: -3.4, 16: -1.9, 17: -1.5, 18: -1.1, 19: -0.4, 20: -0.1, 21: 3.5, 22: 3.8, 23: 5.3, 24: 5.3, 25: 5.3, 26: 5.3, 27: 5.3, 28: 5.3, 29: 5.3, 30: 5.3, 31: 5.3, 32: 6.4, 33: 6.8, 34: 6.8, 35: 10.2, 36: 10.3, 37: 11.9, 38: 12.1, 39: 14.4, 40: 15.6, 41: 19.2, 42: 19.6}, 'y': {0: -0.7854295, 1: -3820.085, 2: 2.1388333, 3: 1.7212046, 4: 2.227343, 5: 0.04315967, 6: -0.9616607, 7: -1.9878536, 8: -0.52237016, 9: -283.27216, 10: -282.5332, 11: -0.4335017, 12: -1.1585577, 13: -0.008831219, 14: 848.92303, 15: -57.407845, 16: -9.010686, 17: -3.2473037, 18: 0.5536767, 19: 1.8351307, 20: 4.8347697, 21: -6.45842, 22: -1.5683812, 23: 0.9338831, 24: 0.9338831, 25: 97.65833, 26: 1.6500127, 27: 1.6500127, 28: 97.65833, 29: 97.65833, 30: 1.6500127, 31: 97.65833, 32: -3.655422, 33: 1.9058462, 34: 227.5592, 35: 857.7455, 36: -0.68584794, 37: 1.6785516, 38: 1.6425261, 39: 2576.435, 40: 5.4869013, 41: 1.9806856, 42: 9.364718}, 'no.': {0: 'y1', 1: 'y4', 2: 'y3', 3: 'y3', 4: 'y2', 5: 'y3', 6: 'y2', 7: 'y2', 8: 'y2', 9: 'y4', 10: 'y4', 11: 'y1', 12: 'y3', 13: 'y1', 14: 'y4', 15: 'y4', 16: 'y4', 17: 'y4', 18: 'y1', 19: 'y3', 20: 'y4', 21: 'y2', 22: 'y3', 23: 'y3', 24: 'y3', 25: 'y4', 26: 'y3', 27: 'y3', 28: 'y4', 29: 'y3', 30: 'y4', 31: 'y4', 32: 'y2', 33: 'y3', 34: 'y3', 35: 'y4', 36: 'y3', 37: 'y3', 38: 'y3', 39: 'y4', 40: 'y2', 41: 'y3', 42: 'y2'}})

df2 = pd.DataFrame({'delta y': {0: 0.05388353000000001, 1: 0.08500000000003638, 2: 0.14323679999999994, 3: 0.25117999999999996, 4: 0.12734299999999976, 5: 0.36285006000000003, 6: 0.13833930000000005, 7: 0.5121464, 8: 1.97762984, 9: 0.2721599999999853, 10: 0.4667999999999779, 11: 0.2692114, 12: 0.00890970000000002, 13: 0.314458351, 14: 906.34703, 15: 0.0161549999999977, 16: 0.06831400000000087, 17: 0.3723036999999998, 18: 0.2988478, 19: 0.006991300000000145, 20: 0.14423030000000026, 21: 0.04157999999999973, 22: 0.013554200000000183, 23: 0.17486560000000007, 24: 0.17486560000000007, 25: 0.03866999999999621, 26: 0.541264, 27: 0.541264, 28: 0.03866999999999621, 29: 96.5495813, 30: 96.0469873, 31: 0.03866999999999621, 32: 0.05542200000000008, 33: 0.1670513, 34: 225.82040510000002, 35: 0.38250000000005, 36: 0.59580486, 37: 0.10641100000000003, 38: 0.14388610000000002, 39: 0.17099999999982174, 40: 0.11309869999999922, 41: 0.10224489999999986, 42: 0.23528199999999977}, 'x': {0: -17.7, 1: -15.0, 2: -12.5, 3: -12.4, 4: -12.1, 5: -11.2, 6: -8.9, 7: -7.5, 8: -7.5, 9: -6.0, 10: -6.0, 11: -4.7, 12: -4.1, 13: -3.8, 14: -3.4, 15: -3.4, 16: -1.9, 17: -1.5, 18: -1.1, 19: -0.4, 20: -0.1, 21: 3.5, 22: 3.8, 23: 5.3, 24: 5.3, 25: 5.3, 26: 5.3, 27: 5.3, 28: 5.3, 29: 5.3, 30: 5.3, 31: 5.3, 32: 6.4, 33: 6.8, 34: 6.8, 35: 10.2, 36: 10.3, 37: 11.9, 38: 12.1, 39: 14.4, 40: 15.6, 41: 19.2, 42: 19.6}})

final = df1.merge(df2, on="x")

Works for me. Can you post a running script that demonstrates the problem? You shouldn't need 47 rows, just a few should do. It would also be useful to do `df1['x'] - df2['x']` on your existing data to make sure they are equal. — tdelaney, Nov 19 '22 at 23:50
I did `df1['x'] - df2['x']` and all values are 0.0. Can you explain me what you mean by "posting a running script?" I'm kinda new to programming and to this website — Peter M, Nov 19 '22 at 23:55
I cleaned up your your examples by getting rid of the .... row, initialized the dataframes and then did the merge. It worked for me. But that was a bit of work. You could trim down df1 and df2 to something smaller that still fails. To make a working example, you could do `df1.to_dict()` then paste that result into a test script where you do `df1 = pd.DataFrame( ... paste dict here ...)`. Then the same with `df2`. Then add the merge to the script. Now it runs, demonstrates the problem, and we can grab the result ourselves to test. — tdelaney, Nov 20 '22 at 00:00
@tdelaney I even used 20 decimals and in all values I get 0.00000000000000000000 — Peter M, Nov 20 '22 at 00:00
You could do ` (df1['x'] - df2['x']).any()` but it sounds like they are the same. Not sure what the problem is. — tdelaney, Nov 20 '22 at 00:02
Can you also post `df1.dtypes` and `df2.dtypes`? See if there is any difference. — tdelaney, Nov 20 '22 at 00:03
@tdelaney yeah it is impossible that the numbers are different beyond that decimal, give me a moment to see if I can extract some info — Peter M, Nov 20 '22 at 00:04
Yes. The problem is that x values aren't unique. If the 2 x columns are identical and you haven't done anything to change the indexes of df1 and df2, you could just add the wanted column `df1["delta y"] = df2["delta y"]`. — tdelaney, Nov 20 '22 at 01:41

score 1 · Answer 1 · answered Nov 19 '22 at 23:20

1

try the following: df1.join(df2)

join is a column-wise left join

pd.merge is a column-wise inner join

pd.concat is a row-wise outer join

pd.concat: takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2]) Dimensions of DataFrame should match along axis

Join and pd.merge: can take DataFrame arguments

ref: Merge two dataframes by index

answered Nov 19 '22 at 23:20

Khaned

441
2
14

1

Gives me this "raise ValueError(f"columns overlap but no suffix specified: {to_rename}") ValueError: columns overlap but no suffix specified: Index(['x'], dtype='object') " – Peter M Nov 19 '22 at 23:28
try df1.join(df2, on='x', how='left', lsuffix='_left', rsuffix='_right') or try pd.concat([df1, df2], axis=1) – Khaned Nov 19 '22 at 23:31
1

The last one (concat) worked but now i have duplicate `x `columns. One as the 1st column and the other as the last. It's no problem, i can just remove one, just wondering why it happened – Peter M Nov 19 '22 at 23:51
add inner join as an argument in function – Khaned Nov 19 '22 at 23:59
Can you please explain, I'm new in Python. I did this `final = pd.concat([df1, df2], how="inner", axis=1)`but It is obviously wrong – Peter M Nov 20 '22 at 01:09
check the documentation for function arguments, "how" is not an argument in pd.concat function link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html – Khaned Nov 20 '22 at 01:17
I tried `final = pd.concat([df1, df2], axis=1, join="inner")`. Same result – Peter M Nov 20 '22 at 01:28

score 1 · Answer 2 · answered Nov 19 '22 at 23:38

1

Try the following syntax and I encourage you to thoroughly read the official documentation of python, the link is given at the bottom. I think you might have different x values in df1 and df2 and they are not 100% identical. This could be perhaps because of the decimals.

import pandas as pd

left = pd.DataFrame(
   {
       "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
     }
 )


right = pd.DataFrame(
     {
        "key": ["K0", "K1", "K2", "K3"],
         "C": ["C0", "C1", "C2", "C3"],
         "D": ["D0", "D1", "D2", "D3"],
     }
 )
 

result = pd.merge(left, right, on="key")

Result Image

Python Merge,Join, Concatenate Official Guide

answered Nov 19 '22 at 23:38

Montreal

15
8

Hi. Thank you! I will read it when I'm done with this project (deadline soon). But how come I have different x values when this column was untouched the whole program? – Peter M Nov 19 '22 at 23:49
@PeterM you could do `df1['x'] - df2['x']` to see if any are non-zero. We don't see your code so don't know whether these _should_ be the same or not. But there can be subtle differences, for instance one `float16` and the other `float64`. – tdelaney Nov 19 '22 at 23:54

score 1 · Accepted Answer · answered Nov 20 '22 at 02:40

The problem is that x values are not unique, so the merge duplicates rows to get all of the combinations. In a simple example

>>> import pandas as pd
>>> df1=pd.DataFrame({"a":[1,2,3,2], "b":['a', 'b', 'c', 'd']})
>>> df2=pd.DataFrame({"a":[1,2,3,2], "c":['aa', 'bb', 'cc', 'dd']})
>>> df1.merge(df2, on='a')
   a  b   c
0  1  a  aa
1  2  b  bb
2  2  b  dd
3  2  d  bb
4  2  d  dd
5  3  c  cc

2 is not unique in the column and gets all of the combinations (notice b --> dd and d --> dd).

In your case, the x column is identical in the two dataframes. This would also mean that indexes haven't changed and you could assign the columns you want to df1.

df1["delta y"] = df2["delta y"]

Why merging 2 data frames gives me one with triple the rows

3 Answers3