0

Comparing each date and returning the difference

Each date represents a set of values that comes from different sources. In theory, it should be equal. I need to check if they really are.

import pandas as pd

df1 = pd.DataFrame({'Stock01': ['PETR3', 'PETR3', 'PETR3', 'PETR3'],
                    'Stock02': ['ENAT3', 'ENAT3', 'ENAT3', 'BBAS3'],
                    'Stock03': ['ITUB3', 'ITUB3', 'ITUB3', 'ITUB3'],
                    'Stock04': ['NaN', 'NaN', 'EBRN3', 'NaN'],},
                  index= ['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13'])

df2 = pd.DataFrame({'Stock01': ['PETR3', 'PETR3', 'PETR3', 'PETR3'],
                    'Stock02': ['ENAT3', 'XPTO5', 'ENAT3', 'ENAT3'],
                    'Stock03': ['ITUB3', 'ITUB3', 'ITUB3', 'ITUB3'],
                    },
                  index= ['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13'])

The output I am looking for is:

[![Difference = pd.DataFrame({'df1 & df2': \['NaN', 'ENAT3, XPTO3', 'ENBR3','ENAT3, BBAS3'\]},
                              index= \['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13'\])][1]][1]
wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • Do you need to check if they are equal or not as you say in the explanation of your problem, or do you need to fin the differences (in what they differ) as you say in the title? – Ignatius Reilly Jun 04 '22 at 00:47
  • I get a syntax error when running your output – Ignatius Reilly Jun 04 '22 at 00:48
  • Does this answer your question? [Compare two dataframes with same index using one column](https://stackoverflow.com/questions/52213804/compare-two-dataframes-with-same-index-using-one-column) – Ignatius Reilly Jun 04 '22 at 00:52
  • `'XPTO3'` and `'ENBR3'` don't occur in the input. Did you mean `'XPTO5'` and `'EBRN3'`? – wjandrea Jun 04 '22 at 00:52
  • Difference = pd.DataFrame({'df1 & df2': ['NaN', 'ENAT3, XPTO3', 'ENBR3','ENAT3, BBAS3']}, index= ['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13']) – Luiz Alegro Jun 04 '22 at 01:06
  • `Difference = pd.DataFrame({'df1 & df2': ['NaN', 'ENAT3, XPTO3', 'ENBR3','ENAT3, BBAS3']}, index= ['2022-05-10', '2022-05-11', '2022-05-12', '2022-05-13'])` – Luiz Alegro Jun 04 '22 at 01:08
  • I need to check if there are differences between them. Try to use the output code above. Sorry about that. @IgnatiusReilly – Luiz Alegro Jun 04 '22 at 01:09
  • You are correct @wjandrea – Luiz Alegro Jun 04 '22 at 01:11
  • Unfortunately, @IgnatiusReilly that answer doesn´t solve my problem because it just compares exact information. – Luiz Alegro Jun 04 '22 at 01:12

1 Answers1

0

pandas compare() is usually the way to compare DataFrames, assuming they have the same columns. Otherwise, you know in advance there're differences.

Because you seem to want to capture the possibility of different columns also, we can first create dummy columns in each table, so the compare function can be applied:

df1_unique_columns = df1.columns.difference(df2.columns)
df2_unique_columns = df2.columns.difference(df1.columns)

df1_with_dummies = df1.copy()
df2_with_dummies = df2.copy()

if len(df1_unique_columns) > 0:
    for col in df1_unique_columns:
        df2_with_dummies[col] = 'NaN'

if len(df2_unique_columns) > 0:
    for col in df2_unique_columns:
        df1_with_dummies[col] = 'NaN'

df2_with_dummies = df2_with_dummies[df1_with_dummies.columns]

comparison = df1_with_dummies.compare(df2_with_dummies)

comparison
('index', '') ('Stock02', 'self') ('Stock02', 'other') ('Stock04', 'self') ('Stock04', 'other')
2022-05-11 ENAT3 XPTO5 NaN NaN
2022-05-12 NaN NaN EBRN3 NaN
2022-05-13 BBAS3 ENAT3 NaN NaN

This can be quite useful to operate later with the differences (it state for example in which column they occur.

For the particular output you were looking for:

columns_with_differences = set([comparison.columns[i][0] for i in range(len(comparison.columns))])
columns_with_differences


def concat_diff(row):
    row['df1 & df2'] = ', '.join([str(row[col][so]) for so in ['self','other'] for col in columns_with_differences if (type(row[col][so]) == str and row[col][so] is not 'NaN')])
    return row

difference = pd.DataFrame(index=comparison.index)

difference['df1 & df2'] = comparison.apply(concat_diff, axis='columns')['df1 & df2']

difference
index df1 & df2
2022-05-11 ENAT3, XPTO5
2022-05-12 EBRN3
2022-05-13 BBAS3, ENAT3

Note it only shows differences, as you stated in the description of the problem. So its not showing the first line with 'NaN' to indicate that there're no differences. You can use the index of one of the original tables to fill all the cases were not differences were found.

Ignatius Reilly
  • 1,594
  • 2
  • 6
  • 15