Comparing 2 rows of different dataframes to find the differences

Question

Comparing each date and returning the difference

Each date represents a set of values that comes from different sources. In theory, it should be equal. I need to check if they really are.

import pandas as pd

df1 = pd.DataFrame({'Stock01': ['PETR3', 'PETR3', 'PETR3', 'PETR3'],
                    'Stock02': ['ENAT3', 'ENAT3', 'ENAT3', 'BBAS3'],
                    'Stock03': ['ITUB3', 'ITUB3', 'ITUB3', 'ITUB3'],
                    'Stock04': ['NaN', 'NaN', 'EBRN3', 'NaN'],},
                  index= ['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13'])

df2 = pd.DataFrame({'Stock01': ['PETR3', 'PETR3', 'PETR3', 'PETR3'],
                    'Stock02': ['ENAT3', 'XPTO5', 'ENAT3', 'ENAT3'],
                    'Stock03': ['ITUB3', 'ITUB3', 'ITUB3', 'ITUB3'],
                    },
                  index= ['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13'])

The output I am looking for is:

[![Difference = pd.DataFrame({'df1 & df2': \['NaN', 'ENAT3, XPTO3', 'ENBR3','ENAT3, BBAS3'\]},
                              index= \['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13'\])][1]][1]

Do you need to check if they are equal or not as you say in the explanation of your problem, or do you need to fin the differences (in what they differ) as you say in the title? — Ignatius Reilly, Jun 04 '22 at 00:47
Does this answer your question? [Compare two dataframes with same index using one column](https://stackoverflow.com/questions/52213804/compare-two-dataframes-with-same-index-using-one-column) — Ignatius Reilly, Jun 04 '22 at 00:52
`'XPTO3'` and `'ENBR3'` don't occur in the input. Did you mean `'XPTO5'` and `'EBRN3'`? — wjandrea, Jun 04 '22 at 00:52
Difference = pd.DataFrame({'df1 & df2': ['NaN', 'ENAT3, XPTO3', 'ENBR3','ENAT3, BBAS3']}, index= ['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13']) — Luiz Alegro, Jun 04 '22 at 01:06
`Difference = pd.DataFrame({'df1 & df2': ['NaN', 'ENAT3, XPTO3', 'ENBR3','ENAT3, BBAS3']}, index= ['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13'])` — Luiz Alegro, Jun 04 '22 at 01:08
I need to check if there are differences between them. Try to use the output code above. Sorry about that. @IgnatiusReilly — Luiz Alegro, Jun 04 '22 at 01:09
Unfortunately, @IgnatiusReilly that answer doesn´t solve my problem because it just compares exact information. — Luiz Alegro, Jun 04 '22 at 01:12

Ignatius Reilly · Answer 1 · 2022-06-04T17:48:02.593

pandas compare() is usually the way to compare DataFrames, assuming they have the same columns. Otherwise, you know in advance there're differences.

Because you seem to want to capture the possibility of different columns also, we can first create dummy columns in each table, so the compare function can be applied:

df1_unique_columns = df1.columns.difference(df2.columns)
df2_unique_columns = df2.columns.difference(df1.columns)

df1_with_dummies = df1.copy()
df2_with_dummies = df2.copy()

if len(df1_unique_columns) > 0:
    for col in df1_unique_columns:
        df2_with_dummies[col] = 'NaN'

if len(df2_unique_columns) > 0:
    for col in df2_unique_columns:
        df1_with_dummies[col] = 'NaN'

df2_with_dummies = df2_with_dummies[df1_with_dummies.columns]

comparison = df1_with_dummies.compare(df2_with_dummies)

comparison

('index', '')	('Stock02', 'self')	('Stock02', 'other')	('Stock04', 'self')	('Stock04', 'other')
2022-05-11	ENAT3	XPTO5	NaN	NaN
2022-05-12	NaN	NaN	EBRN3	NaN
2022-05-13	BBAS3	ENAT3	NaN	NaN

This can be quite useful to operate later with the differences (it state for example in which column they occur.

For the particular output you were looking for:

columns_with_differences = set([comparison.columns[i][0] for i in range(len(comparison.columns))])
columns_with_differences


def concat_diff(row):
    row['df1 & df2'] = ', '.join([str(row[col][so]) for so in ['self','other'] for col in columns_with_differences if (type(row[col][so]) == str and row[col][so] is not 'NaN')])
    return row

difference = pd.DataFrame(index=comparison.index)

difference['df1 & df2'] = comparison.apply(concat_diff, axis='columns')['df1 & df2']

difference

index	df1 & df2
2022-05-11	ENAT3, XPTO5
2022-05-12	EBRN3
2022-05-13	BBAS3, ENAT3

Note it only shows differences, as you stated in the description of the problem. So its not showing the first line with 'NaN' to indicate that there're no differences. You can use the index of one of the original tables to fill all the cases were not differences were found.

It is not ok because it is not considering all the columns in df2 @Ignatius Reilly — Luiz Alegro, Jun 04 '22 at 12:21
Not sure what wasn't working. Here's a more straightforward way of doing it. It works for differences in columns and values in both tables. — Ignatius Reilly, Jun 04 '22 at 17:49

Comparing 2 rows of different dataframes to find the differences

1 Answers1