1

I am facing issues in concatenating two Dataframes of different lengths. Below is the issue:

    df1 = 
emp_id emp_name counts
1      sam       0
2      joe       0
3      john      0
    
df2 =
emp_id emp_name counts
1      sam       0
2      joe       0
2      joe       1
3      john      0

My Expected output is: Please Note that my expectation is not to merge the 2 dataframes into one but I would like to concat two dataframes side by side and highlight the differences in such a way that, if there is a duplicate row in one df, example df2, the respective row of df1 should show as NaN/blank/None any Null kind of values

Expected_output_df = 
df1                      df2    
empId   emp_name counts  emp_id   emp_name  counts
1       sam       0      1        sam       0
2       joe       0      2        joe       0
NaN     NaN       NaN    2        joe       1
3       john      0      3        john      0

whereas am getting output as below:

actual_output_df = pd.concat([df1, df2], axis='columns', keys=['df1','df2'])

the above code gives me below mentioned Dataframe. but how can I get the dataframe which is mentioned in the Expected output

actual_output_df = 
df1                      df2    
empId   emp_name counts  emp_id   emp_name  counts
1       sam       0      1        sam       0
2       joe       0      2        joe       0
3       john      0      2        joe       1
NaN     NaN       NaN    3        john      0

Tried pd.concat by passing different parameters but not getting expected result. The main issue I have in concat is, I am not able to move the duplicate rows one row down.

Can anyone please help me on this? Thanks in advance

  • In the above mentioned Problem, the 'counts' column is not necessary for me. I can drop it if required. – Bhavyashree717 Nov 16 '21 at 10:39
  • https://stackoverflow.com/questions/17095101/compare-two-dataframes-and-output-their-differences-side-by-side This is very similar to what I am looking for except there are no duplicate rows in this example where as I have duplicate rows in my example. – Bhavyashree717 Nov 16 '21 at 11:27

1 Answers1

1

This does not give the exact output you asked, but it could solve your problem anyway:

df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)

Output:

    emp_id  emp_name    counts  _merge
0   1       sam         0       both
1   2       joe         0       both
2   3       john        0       both
3   2       joe         1       right_only

You don't have rows with NaNs as you wanted, but in this way you can check whether a row is in the left df, right df or both by looking at the _merge column. You can also give a custom name to that columns using indicator='name'.


Update

To get the exact output you want you can do the following:

output_df = df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)

output_df[['emp_id2', 'emp_name2', 'counts2']] = output_df[['emp_id', 'emp_name', 'counts']]

output_df.loc[output_df._merge == 'right_only', ['emp_id', 'emp_name', 'counts']] = np.nan
output_df.loc[output_df._merge == 'left_only', ['emp_id2', 'emp_name2', 'counts2']] = np.nan
output_df = output_df.drop('_merge', axis=1)

output_df.columns = pd.MultiIndex.from_tuples([('df1', 'emp_id'), ('df1', 'emp_name'), ('df1', 'counts'), 
                     ('df2', 'emp_id'), ('df2', 'emp_name'), ('df2', 'counts')])

Output:

    df1                         df2
    emp_id  emp_name    counts  emp_id  emp_name    counts
0   1.0     sam         0.0     1.0     sam         0.0
1   2.0     joe         0.0     2.0     joe         0.0
2   3.0     john        0.0     3.0     john        0.0
3   NaN     NaN         NaN     2.0     joe         1.0
Simone
  • 695
  • 4
  • 6