Compare custom column values and print difference in pandas DataFrame

Question

I have two DataFrames:

df1 = {'MS': [1000, 1005, 1007, NaN, 1010, 1012, 1020],
      'Command': ['RD', 'RD', 'WR', '---', 'RD', 'RD', 'WR'],
      'Data1': [100, 110, 120, NaN, 130, 140, 150],
      'Data2': ['A', 'A', 'B', '--', 'A', 'B', 'B'],
      'Data3': [1, 0, 0, NaN, 1, 1, 0]}

df2 = {'MS': [1001, 1006, 1010, NaN, 1003, 1015, 1020, 1030],
      'Command': ['WR', 'RD', 'WR', '---', 'RD', 'RD', 'WR', 'RD'],
      'Data1': [120, 110, 120, NaN, 140, 130, 150, 110],
      'Data2': ['B', 'A', 'B', '--', 'B', 'A', 'B', 'A'],
      'Data3': [0, 0, 1, NaN, 1, 0, 0, 0]}

I want to compare every row in 'df1' with 'df2', except 'MS' column, where 'MS' is time in milliseconds. Both the DFs have identical columns. Column 'MS' might contain NaN, which case need to be ignored.

By comparing, I want to print

Matching rows in 'df1' and 'df2', one below the other, with a new column 'Diff' having 'MS' difference between the values; from above example, row 3 in 'df1' is matching with row 1 of 'df2', so print,

       MS      Diff      Command      Data1      Data2     Data3
    0  1007    NaN       WR           120        B         0
    1  1001    6         WR           120        B         0

Print all unmatched rows in df1 and df2
Compare function should be generic enough to accept an argument with columns of choice and compare only those values in columns to consider match or no-match. For example, every iteration I may pass different column lists,

itr1_comp_col = ['Command', 'Data1', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']

For respective iterations, it shall compare only those column values of user choice.

So far I am not able to produce any satisfactory code. I am a beginner to Pandas and I have tried grouping them by 'Command' column and concatenating two identical groups by dropping duplicates, as discussed in this thread. I have manually looped through values in every row and compared, which is absolutely inefficient, as data is very huge, some million entries.

Please suggest an efficient way to handle above case. Thanks in advance.

Matching rows of two dataframes is commonly done using a dataframe merge operation. Consider looking at the results of this statement: `res = pd.merge(df1, df2, on=['Command', 'Data1', 'Data2', 'Data3'], how='inner')` which should return the matched rows and two extra columns `MS_x` and `MS_y`. You should be able to compute the difference `res['MS_x'] - res['MS_y']`. — Ankur, Jan 21 '21 at 05:11
Current code I have is `res = pd.merge(df1, df2, on=['Command', 'Data1', 'Data2', 'Data3'], how='outer')` and then finding difference among the `MS` columns, like `res['Diff'] = res['MS_x'] - res['MS_y']` — Kiran, Jan 21 '21 at 07:50

score 0 · Answer 1 · answered Jan 21 '21 at 09:58

I will answer my own question, wrt, @Ankur said in his comments:

Even though this doesn't print matching rows one below the other, however it partially fulfils the requirement.

Referring to this page, merge can be used to find difference in DFs. Especially, the argument how= will do the work. Below is the function:

def find_diff(df1: pd.DataFrame, df2: pd.DataFrame, col_itr):
    res = pd.merge(df1, df2, on=col_itr, how='outer')
    res['Diff'] = res['MS_x'] - res['MS_y']
    print (res)

Usage:

import pandas as pd
import numpy as np

d1 = {'MS': [1000, 1005, 1007, np.NaN, 1010, 1012, 1020],
      'Command': ['RD', 'RD', 'WR', '-', 'RD', 'RD', 'WR'],
      'Data1': [100, 110, 120, np.NaN, 130, 140, 150],
      'Data2': ['A', 'A', 'B', '-', 'A', 'B', 'B'],
      'Data3': [1, 0, 0, np.NaN, 1, 1, 0]}

d2 = {'MS': [1001, 1006, 1010, np.NaN, 1003, 1015, 1020, 1030],
      'Command': ['WR', 'RD', 'WR', '-', 'RD', 'RD', 'WR', 'RD'],
      'Data1': [120, 110, 120, np.NaN, 140, 130, 150, 110],
      'Data2': ['B', 'A', 'B', '-', 'B', 'A', 'B', 'A'],
      'Data3': [0, 0, 1, np.NaN, 1, 0, 0, 0]}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

itr1_comp_col = ['Command', 'Data1', 'Data2', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']

find_diff(df1, df2, itr1_comp_col)
find_diff(df1, df2, itr2_comp_col)

Compare custom column values and print difference in pandas DataFrame

1 Answers1