0

I want to understand the pd.merge work nature. I have two dataframes that have unequal length. When trying to merge them through this command

merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')

The length was different from expected as follows

length of comps: 4829
length of surgical: 7939
length of merged: 9531

From my own understanding, merged dataframe should have as same as the length of comps dataframe since left join will look for matching keys in both dataframes and discard the rest. As long as comps length is less than surgical length, the merged length should be 4829. Why does it have 9531?? larger number than the length of both. Even if I changed the how parameter to "right", merged has a larger number than expected.

Generally, I want to know how to merge two dataframes that have unequal length specifying some columns from the right dataframe. Also, how do I validate the merge operation?. Find this might be helpful:

comps_ls: list of complications I want to throw on surgical dataframe.
mrn, Admission: the key columns I want to merge the two dataframes on.

Note: a teammate suggests this solution

merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
merged = surgical.join(merged, on=['mrn'], how='left', lsuffix="", rsuffix="_r")

The length of the output was as follows

length of comps: 4829
length of surgical: 7939
length of merged: 7939

How can this help?

Nemra Khalil
  • 69
  • 1
  • 6
  • 1
    It means that for a unique `('mrn', 'Admission')` , the right dataframe (`surgical`) has multiple entries, and they are all selected during the merge. – XavierBrt Jan 23 '21 at 17:16

2 Answers2

2

The "issue" is with duplicated merge keys, which can cause the resulting merge to be larger than the original. For a left merge you can expect the result to be in between N_rows_left and N_rows_left * N_rows_right rows long. The lower bound is in the case that both the left and right DataFrames have no duplicate merge keys, and the upper bound is the case when the left and right DataFrames have the single same value for the merge keys on every row.

Here's a worked example. All DataFrames are 4 rows long, but df2 has duplicate merge keys. As a result when df2 is merged to df the output is longer than df, because for the row with 2 as the key in df, both rows in df2 are matched.

import pandas as pd
df = pd.DataFrame({'key': [1,2,3,4]})
df1 = pd.DataFrame({'row': range(4), 'key': [2,3,4,5]})
df2 = pd.DataFrame({'row': range(4), 'key': [2,2,3,3]})

# Neither frame duplicated on merge key, result is same length (4) as left. 
df.merge(df1, on='key', how='left')

#   key  row
#0    1  NaN
#1    2  0.0
#2    3  1.0
#3    4  2.0


# df2 is duplicated on the merge keys so we get >4 rows 
df.merge(df2, on='key', how='left')

#   key  row
#0    1  NaN
#1    2  0.0    # Both `2` rows matched
#2    2  1.0    # ^
#3    3  2.0    # Both `3` rows matched
#4    3  3.0    # ^
#5    4  NaN
ALollz
  • 57,915
  • 7
  • 66
  • 89
2

If the length of the merged dataframe is greater than the length of the left dataframe, it means that the right dataframe has multiple entries for the same joining key. For instance if you have these dataframes:

df1
---
    id      product
0   111     car
1   222     bike
df2
---
    id      color
0   111     blue
1   222     red
3   222     green
3   333     yellow

A merge will render 3 rows, because there are two possible matches for the row of df1 with id 222.

df1.merge(df2, on="id", how="left")
---
    id      product  color
0   111     car      blue
1   222     bike     red
2   222     bike     green
XavierBrt
  • 1,179
  • 8
  • 13