I want to understand the pd.merge
work nature. I have two dataframes that have unequal length. When trying to merge them through this command
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
The length was different from expected as follows
length of comps: 4829
length of surgical: 7939
length of merged: 9531
From my own understanding, merged
dataframe should have as same as the length of comps
dataframe since left
join will look for matching keys in both dataframes and discard the rest. As long as comps
length is less than surgical
length, the merged
length should be 4829. Why does it have 9531?? larger number than the length of both. Even if I changed the how
parameter to "right"
, merged
has a larger number than expected.
Generally, I want to know how to merge two dataframes that have unequal length specifying some columns from the right dataframe. Also, how do I validate the merge operation?. Find this might be helpful:
comps_ls: list of complications I want to throw on surgical dataframe.
mrn, Admission: the key columns I want to merge the two dataframes on.
Note: a teammate suggests this solution
merged = pd.merge(surgical, comps[comps_ls+['mrn','Admission']], on=['mrn','Admission'], how='left')
merged = surgical.join(merged, on=['mrn'], how='left', lsuffix="", rsuffix="_r")
The length of the output was as follows
length of comps: 4829
length of surgical: 7939
length of merged: 7939
How can this help?