1

I have two data frames with identical columns. I would like to generate a new df where the data is not the same between the columns in the data frames. Like this: enter image description here Note, I have done some pre-processing such that:

  • All ids in df1 exist in df2; all ids in df2 exist in df1
  • There are no NA/NAN values.
  • All cells are strings with a minimum length of 1 and a max length of 3500.
  • The names of the columns that need to be compared are stored in a list.

I'm not sure how to get this granular information, I have tried iterating over each column and generating a dataframe. Like this:

for v in col_list:
    m_df = pd.merge(df1, df2, on = ['id',v], how = 'outer', indicator=True]).('_merge != "both"')

But, I'm not sure how to combine these data frames into a single data frame.

This solution closely address my problem but I don't know how to transform it for my needs: https://stackoverflow.com/a/47112033/7987118

Harsha
  • 353
  • 1
  • 15

1 Answers1

0

just do df==df2

# module

import pandas as pd
import numpy as np

d=[['v1', 'v3', 'v56', 'sstr'],
['v2', 'v4', 'v34', 'sstr'],
['v3', 'v5', 'v12', ''],
['v4', 'v6', 'v-10', 'sstr'],
['v5', 'v7', 'v-32', 'sstr']]

d2=[['v1', 'v3', 'v56', 'sstr'],
['v234', 'v4adf', 'v34', 'sstr'],
['v3', 'v5asd', 'v12', 'sstr'],
['v4', 'v6', 'asdfasdf', 'sstr'],
['v5', 'v7', 'v-32', 'sstr']]


# datasets


df=pd.DataFrame(d,columns='a b c d'.split())
df2=pd.DataFrame(d2,columns='a b c d'.split())
print(df)
print(df2)

#check

print(df==df2)
#       a      b      c      d
#0   True   True   True   True
#1  False  False   True   True
#2   True  False   True  False
#3   True   True  False   True
#4   True   True   True   True
sanzo213
  • 129
  • 4
  • The columns names ares the same between the two dfs and I'm getting this error: /venv_2/lib64/python3.7/site-packages/pandas/core/ops/__init__.py", line 511, in _align_method_FRAME "Can only compare identically-labeled DataFrame objects" ValueError: Can only compare identically-labeled DataFrame objects. – Harsha Aug 03 '21 at 13:11
  • @Harsha did you check columns order in 2 dfs and use `.info()` method? – sanzo213 Aug 03 '21 at 13:28