I have two DataFrames:
df1 = {'MS': [1000, 1005, 1007, NaN, 1010, 1012, 1020],
'Command': ['RD', 'RD', 'WR', '---', 'RD', 'RD', 'WR'],
'Data1': [100, 110, 120, NaN, 130, 140, 150],
'Data2': ['A', 'A', 'B', '--', 'A', 'B', 'B'],
'Data3': [1, 0, 0, NaN, 1, 1, 0]}
df2 = {'MS': [1001, 1006, 1010, NaN, 1003, 1015, 1020, 1030],
'Command': ['WR', 'RD', 'WR', '---', 'RD', 'RD', 'WR', 'RD'],
'Data1': [120, 110, 120, NaN, 140, 130, 150, 110],
'Data2': ['B', 'A', 'B', '--', 'B', 'A', 'B', 'A'],
'Data3': [0, 0, 1, NaN, 1, 0, 0, 0]}
I want to compare every row in 'df1
' with 'df2
', except 'MS
' column, where 'MS
' is time in milliseconds. Both the DFs have identical columns. Column 'MS' might contain NaN, which case need to be ignored.
By comparing, I want to print
- Matching rows in '
df1
' and 'df2
', one below the other, with a new column 'Diff
' having 'MS
' difference between the values; from above example, row 3 in 'df1
' is matching with row 1 of 'df2
', so print,
MS Diff Command Data1 Data2 Data3
0 1007 NaN WR 120 B 0
1 1001 6 WR 120 B 0
- Print all unmatched rows in
df1
anddf2
- Compare function should be generic enough to accept an argument with columns of choice and compare only those values in columns to consider match or no-match. For example, every iteration I may pass different column lists,
itr1_comp_col = ['Command', 'Data1', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']
For respective iterations, it shall compare only those column values of user choice.
So far I am not able to produce any satisfactory code. I am a beginner to Pandas and
I have tried grouping them by 'Command
' column and concatenating two identical groups by dropping duplicates, as discussed in this thread.
I have manually looped through values in every row and compared, which is absolutely inefficient, as data is very huge, some million entries.
Please suggest an efficient way to handle above case. Thanks in advance.