1

I would like to compare two unsorted dataframes and report the differences by column and row ID coordinates and values.

The code, first compares the csv's and if they are not equal it compares based on a merge, and if they still are not equal I know that data is different in some way.

At this point i'm not sure how to identify the column and row coordinates along with the value for each dataframe where the data is identified to be different.

Here are the dataframes:

DATAFRAME 1 - EXP:
enter image description here

DATAFRAME 2 - ACT:
enter image description here

Here is my current code:

import pandas as pd
file1 = "c:\\exp.csv"
file2 = "c:\\act.csv"
exp = pd.read_csv(file1, encoding="ANSI")
act = pd.read_csv(file2, encoding="ANSI")
exp = exp.drop_duplicates(subset=None, keep='first', inplace=False)
act = act.drop_duplicates(subset=None, keep='first', inplace=False)
result = exp.equals(act)
if result:
    print("CSV's Match")
else:
    act = act.set_index('accounts')
    dataMerged = exp.merge(act, how='left')
    dataMergedAndSorted = dataMerged.sort_values(['accounts']).set_index('account numbers')
    actSorted = act.sort_values(['accounts'])
    if dataMergedAndSorted.equals(actSorted):          
         print("The Merged, Sorted, and Compared Data Now Returns True : PASS")
    else:
         dataMergedAndSorted = dataMergedAndSorted.reset_index() 
         actSorted = actSorted.reset_index()

Known Differences by Observation of the Data Frames & How to Report it:

Exp: Col=1,Row=4,Val=9101, Act: Col=1,Row=3,Val=FOO

Exp: Col=3,Row=6,Val=BAR, Act: Col=3,Row=5,Val=malfoy

Community
  • 1
  • 1
bitShredder
  • 132
  • 1
  • 11
  • Is this helpful? https://stackoverflow.com/questions/18180763/set-difference-for-pandas – Juan Carlos Ramirez Feb 27 '19 at 23:43
  • Juan Carlos Ramierez: Thank ya for the response! I tried a few of those solutions, and it seems to pull the same data in both of my data frames and not the actual differences – bitShredder Feb 28 '19 at 00:12
  • To compute the rows that are in Exp, but not in Act, you can merge them with indicator=True (so that you know whether they came from Exp or act) and then drop those that came from Act. See https://stackoverflow.com/questions/42425549/computing-set-difference-in-pandas-between-two-dataframes?rq=1 then you can do the same to see those that are in Act but not in Exp – Juan Carlos Ramirez Feb 28 '19 at 00:39

1 Answers1

1

The easiest way I have found to identify the differences between two unordered dataframes is as follows:

First you order your DataFrames df1 and df2 by all columns, and then you can use pandas.compare to get the exact columns and rows that contain differences between them:

import pandas as pd

df1_comp = df1.sort_values(by = ['column_1', 'column_2', 'column_3']).reset_index(drop = True)
df2_comp = df2.sort_values(by = ['column_1', 'column_2', 'column_3']).reset_index(drop = True)

df_diff = df1_comp.compare(df2_comp)

This will return a dataframe (df_diff) containing only the columns and rows with differences between df1 and df2.

Rafaela V
  • 157
  • 1
  • 1
  • 8