6

Here are how the dataframes columns look like.

df1='device number', 'date', ....<<10 other columns>> 3500 records

df2='device number', 'date', ....<<9 other columns>> 14,000 records

In each data frame, neither 'device number', nor 'date' are unique. However, their combination is unique to identify a row.

I am trying to form a new data frame which matches the rows from df1 and df2 where both device number and date are equal, and have all the columns from these df1 and df2. The pandas command I am trying is

df3=pd.merge(df1, df2, how='inner', on=['device number', 'date'])

However, df3 gives me a dataframe of shape (14,000, 21). The column number makes sense, but how can the inner join has more rows than any of the left dataframes? Does it mean I have a flaw in my understanding of inner join? Also, how can I achieve the result I described?

sathya
  • 1,982
  • 1
  • 20
  • 37
Della
  • 1,264
  • 2
  • 15
  • 32
  • It would be nice if you can post a code that verify your claims -- e.g. combinations are unique. If both data frames contain unique keys, the result would be (3,500, 21). – Sanghack Lee Jul 23 '17 at 06:41
  • @SanghackLee Thanks, I am pretty green in python. The uniqueness comes from my knowledge of the data source and domain knowledge. But any idea how to verify it from the code? I thought of the DataFrame.duplicated function, but not sure how to make it work on the combination of two fields. – Della Jul 23 '17 at 06:45
  • Maybe ... `print(len(set(zip(df1['device number'], df1['date']))))` and for df2. (Other methods can be found in [here](https://stackoverflow.com/questions/26977076/pandas-unique-values-multiple-columns) – Sanghack Lee Jul 23 '17 at 06:47

1 Answers1

3

Only way I can see this happening... particularly with the 14,000 being the same exact number as the number of records in df2 is if the column combination in df2 are not unique.

You can verify that they are not unique with the following (True if unique)

df2.duplicated(['device number', 'date']).sum() == 0

Or

df.set_index(['device number', 'date']).index.is_unique
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks you sooooo much, this had me banging my head against the wall for 2 days, this answer is perfect! – Contango Oct 16 '20 at 17:28