Python pandas Dataframe comparison

Question

People of stack overflow, help!

I have a leetcode style problem for you guys.

Imagine a scenario where you have 2 2D arrays, more specifically 2 Dataframes with pandas.

I need to compare these 2 Dataframes and highlight all the differences, however there is a catch. Rows can be missing from these data frames which makes this inherently a lot more difficult, as well as missing cells too. I'll provide an example.

import pandas as pd

x = [[0, 1, 2, 3],[4, 5, 6, 7],[8, 9, 10, 11],[12, 13, 14, 15]]
y = [[nan, 1, 2, 3],[4, 5, 6, nan],[12, 13, 14, 15]]

df1 = pd.DataFrame(x)
df2 = pd.DataFrame(y)

How can I identify all of the missing cells AND the missing rows?

Bonus points if you can create code to highlight the differences and export them to an excel sheet ;)

score 0 · Answer 1 · answered Sep 09 '22 at 21:32

Stage 1
A good starting point would be the following StackOverflow question: https://stackoverflow.com/a/48647840/15965988

This would remove 100% duplicate rows from both tables.

Stage 2
At this stage, only rows with differences exist. From here I'd recommend looping over each row. For each row you'll need to create some logic that queries the other dataframe looking for a similar row. During that query consider querying with only some columns.

Best of luck.

score 0 · Accepted Answer · answered Sep 10 '22 at 00:52

example dataset

Slightly tweaking your example data, lets define the following dataframes:

import pandas as pd
import numpy as np
x = [[0, 1, 2, 3],[4, 5, 6, 7],[8, 9, 10, 11],[12, 13, 14, 15]]
y = [[4, 5, 6, 99],[8, 9, np.nan, 11],[12, 13, 14, 15]]

df_ref = pd.DataFrame(x, index=range(4), columns=["a", "b","c","d"])
df = pd.DataFrame(y, index=[1,2,5], columns=["a", "b","c","d"])

df_ref is your "reference" dataframe.

and "df" the dataframe you are comparing it to.

The differences are:

rows 0 and 3 missing
a new row (5)
(0, "d") is equal to 99 instead of 3
(2, "c") is NaN instead of 10

solution

The following solution highlights:

[in red] the "deleted rows" (row indexes that don't appear in df)
[in green] the "new rows" (row indexes that don't appear in df_ref)
[in orange] the values that differ for common rows

def get_dataframes_diff(df: pd.DataFrame, df_ref: pd.DataFrame, path_excel = None):
    rows_new = df.index[~df.index.isin(df_ref.index)]
    rows_del = df_ref.index[~df_ref.index.isin(df.index)]
    rows_common = df_ref.index.intersection(df.index)

    df_diff = pd.concat([df, df_ref.loc[rows_del]]).sort_index()
    s = df_diff.style

    def format_row(row, color: str = "white", bg_color: str = "green"):
        return [f"color: {color}; background-color: {bg_color}"] * len(row)

    s.apply(format_row, subset = (rows_new, df.columns), color="white", bg_color="green", axis=1)
    s.apply(format_row, subset = (rows_del, df.columns), color="white", bg_color="red", axis=1)

    mask = pd.DataFrame(True, index=df_diff.index, columns=df_diff.columns)
    mask.loc[rows_same] = (df_ref.loc[rows_same] == df.loc[rows_same])
    mask.replace(True, None, inplace=True)
    mask.replace(False, "color: black; background-color: orange;", inplace=True)

    s.apply(lambda _: mask, axis=None)

    if path_excel is not None:
        s.to_excel(path_excel)
    return s

It gives:

get_dataframes_diff(df, df_ref)

explanation

get the list of deleted rows, new rows and those in common

rows_new = df.index[~df.index.isin(df_ref.index)]
rows_del = df_ref.index[~df_ref.index.isin(df.index)]
rows_same = df_ref.index.intersection(df.index)

create a "diff" dataframe, by adding the deleted rows to the df dataframe

df_diff = pd.concat([df, df_ref.loc[rows_del]]).sort_index()

Use Styler.apply to highlight in green the new rows, and red the lines deleted (note the use of the subsetargument):

def format_row(row, color: str = "white", bg_color: str = "green"):
    return [f"color: {color}; background-color: {bg_color}"] * len(row)

df_diff.style.apply(format_row, subset = (rows_new, df.columns), color="white", bg_color="green", axis=1)
df_diff.style.apply(format_row, subset = (rows_del, df.columns), color="white", bg_color="red", axis=1)

To highlight value differences for common rows, create a mask dataframe which equals True for elements that are the same, and False when values differ

mask = pd.DataFrame(True, index=df_diff.index, columns=df_diff.columns)
mask.loc[rows_common] = (df_ref.loc[rows_common] == df.loc[rows_common])

When True (same value), we don't apply any styling. When False, we highlight in orange:

mask.replace(True, None, inplace=True)
mask.replace(False, "color: black; background-color: orange;", inplace=True)

df_diff.style.apply(lambda _: mask, axis=None)

Finally if you want to save it as an excel file, provide a valid path to the path_excel argument.

Python pandas Dataframe comparison

2 Answers2

example dataset

solution

explanation