Merged dataframe seems missing two rows

Question

I had run the below code :

df1 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

print(pd.merge(df1,df3, on='HPI'))

I am getting the output as :

    HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   85         3                55            52             8
2   85         3                55            53             6
3   85         2                55            52             8
4   85         2                55            53             6
5   88         2                65            50             9

My Question here is

1) Why I am having so big dataframe. HPI has only 4 values but in output 6 rows has been generated.

2) If merge will take all the values from HPI then why the value 80 and 88 hasn't been taken twice each?

Possible duplicate of [Pandas Left Outer Join results in table larger than left table](https://stackoverflow.com/questions/22720739/pandas-left-outer-join-results-in-table-larger-than-left-table) — jpp, Feb 22 '18 at 09:06
I highly recommend you read through the [pandas documentation on merges](https://pandas.pydata.org/pandas-docs/stable/merging.html). Your question is more about understanding merges generally, and the tutorial there has examples which demonstrate what you can expect. — jpp, Feb 22 '18 at 09:10

jezrael · Accepted Answer · 2018-02-22T09:02:33.840

You get 85 4 times, because duplicated in df1 and df2 in joined columns HPI . And 88 with 80 are unique, so inner join return alo one row for each.

Apparently, the inner join means that if there is a match on the join column in both tables, every row will be matched the maximum number of times possible.

So before merge need remove duplicates for correct output.

df1 = df1.drop_duplicates('HPI')
df3 = df3.drop_duplicates('HPI')

Samples with dupes values in HPI columns and outputs:

#2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#2dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#4dupes 85 - 2x2, value 85 in both columns
print(pd.merge(df1,df3, on='HPI'))
   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   85         3                55            52             8
2   85         3                55            53             6
3   85         2                55            52             8
4   85         2                55            53             6
5   88         2                65            50             9

#2 dupes 80, 2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,80,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#2dupes 85 , unique 80       
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#4dupes 80, 2x1, 4dupes 85 - 2x2, values 80,85 in both columns
print(pd.merge(df1,df3, on='HPI'))
   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   80         2                65            50             7
2   85         3                55            52             8
3   85         3                55            53             6
4   85         2                55            52             8
5   85         2                55            53             6

#2dupes 80
df1 = pd.DataFrame({'HPI':[80,80,82,83],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#2 dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#2dupes 80, 2x1value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   80         3                55            50             7

#4dupes 80
df1 = pd.DataFrame({'HPI':[80,80,80,80],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#3 dupes 80
df3 = pd.DataFrame({'HPI':[80,80,80,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#12dupes 80, 4x3, value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
    HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0    80         2                50            50             7
1    80         2                50            52             8
2    80         2                50            50             9
3    80         3                55            50             7
4    80         3                55            52             8
5    80         3                55            50             9
6    80         2                65            50             7
7    80         2                65            52             8
8    80         2                65            50             9
9    80         2                55            50             7
10   80         2                55            52             8
11   80         2                55            50             9

Downvoter, if there's something wrong with my answer, please let me know, so I can correct it. Thanks. — jezrael, Feb 22 '18 at 06:29
Ok.. But 88 and 80 are present in both dataframes. Then why it's not repeated?? — Dibakar Bose, Feb 22 '18 at 06:44
@DibakarBose - But values are not suplicated, there is difference. — jezrael, Feb 22 '18 at 06:44
@DibakarBose - I add more samples, please check it if it is more clear. — jezrael, Feb 22 '18 at 09:02

Joe · Answer 2 · 2018-02-22T07:56:39.903

As jezrael wrote, you have 6 rows because the values for HPI=85 in df1 and df3 are not unique. On the contrary on df1 and df3 you have only a value for HPI=80 and for HPI=88. If I make an assumption and consider also your index, I can guess that what you want is something like this:

       HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
index                                                             
2001    80         2                50            50             7
2002    85         3                55            52             8
2003    88         2                65            50             9
2004    85         2                55            53             6

If you want something like this, then you can do:

pd.merge(df1, df3, left_index=True, right_index=True, on='HPI')

But I am just making an assumption, so I dont know if this is the output you would like.

Yes... That's what the output I was looking for... :) – Dibakar Bose Feb 23 '18 at 08:30 — Dibakar Bose, Feb 23 '18 at 08:30

Merged dataframe seems missing two rows

2 Answers2