Pandas Concat dataframes with Duplicates

Question

I am facing issues in concatenating two Dataframes of different lengths. Below is the issue:

    df1 = 
emp_id emp_name counts
1      sam       0
2      joe       0
3      john      0
    
df2 =
emp_id emp_name counts
1      sam       0
2      joe       0
2      joe       1
3      john      0

My Expected output is: Please Note that my expectation is not to merge the 2 dataframes into one but I would like to concat two dataframes side by side and highlight the differences in such a way that, if there is a duplicate row in one df, example df2, the respective row of df1 should show as NaN/blank/None any Null kind of values

Expected_output_df = 
df1                      df2    
empId   emp_name counts  emp_id   emp_name  counts
1       sam       0      1        sam       0
2       joe       0      2        joe       0
NaN     NaN       NaN    2        joe       1
3       john      0      3        john      0

whereas am getting output as below:

actual_output_df = pd.concat([df1, df2], axis='columns', keys=['df1','df2'])

the above code gives me below mentioned Dataframe. but how can I get the dataframe which is mentioned in the Expected output

actual_output_df = 
df1                      df2    
empId   emp_name counts  emp_id   emp_name  counts
1       sam       0      1        sam       0
2       joe       0      2        joe       0
3       john      0      2        joe       1
NaN     NaN       NaN    3        john      0

Tried pd.concat by passing different parameters but not getting expected result. The main issue I have in concat is, I am not able to move the duplicate rows one row down.

Can anyone please help me on this? Thanks in advance

In the above mentioned Problem, the 'counts' column is not necessary for me. I can drop it if required. — Bhavyashree717, Nov 16 '21 at 10:39
https://stackoverflow.com/questions/17095101/compare-two-dataframes-and-output-their-differences-side-by-side This is very similar to what I am looking for except there are no duplicate rows in this example where as I have duplicate rows in my example. — Bhavyashree717, Nov 16 '21 at 11:27

Simone · Accepted Answer · 2021-11-17T08:56:55.160

This does not give the exact output you asked, but it could solve your problem anyway:

df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)

Output:

    emp_id  emp_name    counts  _merge
0   1       sam         0       both
1   2       joe         0       both
2   3       john        0       both
3   2       joe         1       right_only

You don't have rows with NaNs as you wanted, but in this way you can check whether a row is in the left df, right df or both by looking at the _merge column. You can also give a custom name to that columns using indicator='name'.

Update

To get the exact output you want you can do the following:

output_df = df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)

output_df[['emp_id2', 'emp_name2', 'counts2']] = output_df[['emp_id', 'emp_name', 'counts']]

output_df.loc[output_df._merge == 'right_only', ['emp_id', 'emp_name', 'counts']] = np.nan
output_df.loc[output_df._merge == 'left_only', ['emp_id2', 'emp_name2', 'counts2']] = np.nan
output_df = output_df.drop('_merge', axis=1)

output_df.columns = pd.MultiIndex.from_tuples([('df1', 'emp_id'), ('df1', 'emp_name'), ('df1', 'counts'), 
                     ('df2', 'emp_id'), ('df2', 'emp_name'), ('df2', 'counts')])

Output:

    df1                         df2
    emp_id  emp_name    counts  emp_id  emp_name    counts
0   1.0     sam         0.0     1.0     sam         0.0
1   2.0     joe         0.0     2.0     joe         0.0
2   3.0     john        0.0     3.0     john        0.0
3   NaN     NaN         NaN     2.0     joe         1.0

Pandas Concat dataframes with Duplicates

1 Answers1

Linked