1

I have two dataframes, df1 and df2. shapes of df1 is (118093,2) and df2 is (690,5)

I am trying to merge the dataframes using:

df3 = df1.merge(df2, on='consId', how='left')

I have removed the duplicates in df2(as suggested here) and I know for a fact that consId in df1 is repeated(although the entire row is not duplicated).

The behaviour I am expecting is that the resulting df3 should have the same number of rows as df1 but with the columns of df2 appended to df1.

My dataframes are as follows:

df_consId[18:25]
Out[22]: 
           consId  consumption
18  1155696024483    97.203390
19  1155696024483    80.527278
20  1155696024483    78.953125
21  1155699007557   137.419355
22  1155699007557   117.145161
23  1155699007557   101.500000
24  1155699007557    85.883333

df.head()
Out[23]: 
          consId rfrg-age  rfrg-rating rfrg-type rfrg-capc
0  1155696024483    812yr          1.0     1door      150L
1  1155699007557    812yr          1.0     1door      150L
2  1155694005571     48yr          1.0     2door      250L
3  1155691016680      NaN          NaN       NaN       NaN
4  1155697016945     13yr          5.0     1door      150L

And I do df3 = df_consId.merge(df, on= 'consId', how = 'left')

Although the merge is working as I intended, there are additional rows in df3 although there are no duplicates in df. Any reason why?

Community
  • 1
  • 1
  • 1
    I see duplicates in `df_consId` so this is expected operation – EdChum Sep 06 '16 at 15:35
  • Can't repro with the data example... as long as you're sure you're using the deduped data for `df2` in the merge - it should just work... – Jon Clements Sep 06 '16 at 15:36
  • 1
    @Ninja Puppy - I did dedupe the data for `df2`. Merge works but I am left with more rows than the number of rows in df1. @EdChum - I do not see why it is an expected operation. I have removed duplicates in df2. Shouldn't duplicates in df1 not matter? At least it shouldn't increase the number of rows after merging. – Harshavardhan Ramanna Sep 06 '16 at 15:58
  • @HarshavardhanRamanna umm... double double check you are using the right dataset in the merge (I've often deduped a dataset, then used the old one instead) - but yeah, if that's deduped - you shouldn't get more rows out than in for df1 – Jon Clements Sep 06 '16 at 16:01
  • @NinjaPuppy - I have no idea why this error occurred. I believe it is some data error in df2. I read df2 from another file and it worked like a charm. – Harshavardhan Ramanna Sep 06 '16 at 17:45

1 Answers1

0

If the left table has duplicate keys where as the right table has unique keys, which is the case with the problem mentioned by you, a left join should give no more than the number of records in the left table.

However, if the left table just has the 2 columns you mentioned, it would be better to group that table by the key, before doing a join.

Narendra Gadidasu
  • 121
  • 1
  • 2
  • 8