You get 85
4 times, because duplicated in df1
and df2
in joined columns HPI
. And 88
with 80
are unique, so inner join return alo one row for each.
Apparently, the inner join
means that if there is a match on the join column in both tables, every row will be matched the maximum number of times possible.
So before merge need remove duplicates for correct output.
df1 = df1.drop_duplicates('HPI')
df3 = df3.drop_duplicates('HPI')
Samples with dupes values in HPI
columns and outputs:
#2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 85 - 2x2, value 85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
#2 dupes 80, 2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,80,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85 , unique 80
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 80, 2x1, 4dupes 85 - 2x2, values 80,85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 65 50 7
2 85 3 55 52 8
3 85 3 55 53 6
4 85 2 55 52 8
5 85 2 55 53 6
#2dupes 80
df1 = pd.DataFrame({'HPI':[80,80,82,83],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2 dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#2dupes 80, 2x1value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 3 55 50 7
#4dupes 80
df1 = pd.DataFrame({'HPI':[80,80,80,80],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#3 dupes 80
df3 = pd.DataFrame({'HPI':[80,80,80,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#12dupes 80, 4x3, value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 50 52 8
2 80 2 50 50 9
3 80 3 55 50 7
4 80 3 55 52 8
5 80 3 55 50 9
6 80 2 65 50 7
7 80 2 65 52 8
8 80 2 65 50 9
9 80 2 55 50 7
10 80 2 55 52 8
11 80 2 55 50 9