0

I had run the below code :

df1 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

print(pd.merge(df1,df3, on='HPI'))

I am getting the output as :

    HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   85         3                55            52             8
2   85         3                55            53             6
3   85         2                55            52             8
4   85         2                55            53             6
5   88         2                65            50             9

My Question here is

1) Why I am having so big dataframe. HPI has only 4 values but in output 6 rows has been generated.

2) If merge will take all the values from HPI then why the value 80 and 88 hasn't been taken twice each?

Dibakar Bose
  • 100
  • 1
  • 8
  • 1
    Possible duplicate of [Pandas Left Outer Join results in table larger than left table](https://stackoverflow.com/questions/22720739/pandas-left-outer-join-results-in-table-larger-than-left-table) – jpp Feb 22 '18 at 09:06
  • I highly recommend you read through the [pandas documentation on merges](https://pandas.pydata.org/pandas-docs/stable/merging.html). Your question is more about understanding merges generally, and the tutorial there has examples which demonstrate what you can expect. – jpp Feb 22 '18 at 09:10
  • @jpp :: Sure thing... I will do that... thanks :) – Dibakar Bose Feb 23 '18 at 04:52

2 Answers2

1

You get 85 4 times, because duplicated in df1 and df2 in joined columns HPI . And 88 with 80 are unique, so inner join return alo one row for each.

Apparently, the inner join means that if there is a match on the join column in both tables, every row will be matched the maximum number of times possible.

So before merge need remove duplicates for correct output.

df1 = df1.drop_duplicates('HPI')
df3 = df3.drop_duplicates('HPI')

Samples with dupes values in HPI columns and outputs:

#2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#2dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#4dupes 85 - 2x2, value 85 in both columns
print(pd.merge(df1,df3, on='HPI'))
   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   85         3                55            52             8
2   85         3                55            53             6
3   85         2                55            52             8
4   85         2                55            53             6
5   88         2                65            50             9

#2 dupes 80, 2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,80,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#2dupes 85 , unique 80       
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#4dupes 80, 2x1, 4dupes 85 - 2x2, values 80,85 in both columns
print(pd.merge(df1,df3, on='HPI'))
   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   80         2                65            50             7
2   85         3                55            52             8
3   85         3                55            53             6
4   85         2                55            52             8
5   85         2                55            53             6

#2dupes 80
df1 = pd.DataFrame({'HPI':[80,80,82,83],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#2 dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#2dupes 80, 2x1value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   80         3                55            50             7

#4dupes 80
df1 = pd.DataFrame({'HPI':[80,80,80,80],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#3 dupes 80
df3 = pd.DataFrame({'HPI':[80,80,80,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#12dupes 80, 4x3, value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
    HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0    80         2                50            50             7
1    80         2                50            52             8
2    80         2                50            50             9
3    80         3                55            50             7
4    80         3                55            52             8
5    80         3                55            50             9
6    80         2                65            50             7
7    80         2                65            52             8
8    80         2                65            50             9
9    80         2                55            50             7
10   80         2                55            52             8
11   80         2                55            50             9
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

As jezrael wrote, you have 6 rows because the values for HPI=85 in df1 and df3 are not unique. On the contrary on df1 and df3 you have only a value for HPI=80 and for HPI=88. If I make an assumption and consider also your index, I can guess that what you want is something like this:

       HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
index                                                             
2001    80         2                50            50             7
2002    85         3                55            52             8
2003    88         2                65            50             9
2004    85         2                55            53             6

If you want something like this, then you can do:

pd.merge(df1, df3, left_index=True, right_index=True, on='HPI')

But I am just making an assumption, so I dont know if this is the output you would like.

Joe
  • 12,057
  • 5
  • 39
  • 55