2

i have two dataframes df1 and df2. Same index and same column names. how to construct a dataframe which shows difference, but only rows which have at least one different cell? if row has different cells, but some are same, keep same cells intact.

example:

df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})

output needed:

pd.DataFrame({1:['a'],2:['c->d']},index=[0])

output in this example should be one row dataframe, not dataframe including same rows

NB: output should only contain full rows which has at least one difference in cell

i'd like an efficient solution without iterating by rows , and without creating special-strings in DataFrame

alexprice
  • 394
  • 4
  • 12

2 Answers2

1

You can use this brilliant solution:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{}->{}'.format(*x)

In [70]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[70]:
   1     2
0  a  c->d
1  a     c

For bit more complex DataFrames:

In [73]: df1
Out[73]:
   A  B  C
0  a  c  1
1  a  c  2
2  1  2  3

In [74]: df2
Out[74]:
   A  B  C
0  a  d  1
1  a  c  2
2  1  2  4

In [75]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[75]:
   A     B     C
0  a  c->d     1
1  a     c     2
2  1     2  3->4

UPDATE: showing only changed/different rows:

In [54]: mask = df1.ne(df2).any(1)

In [55]: mask
Out[55]:
0     True
1    False
2     True
dtype: bool

In [56]: pd.Panel(dict(df1=df1[mask],df2=df2[mask])).apply(report_diff, axis=0)
Out[56]:
   A     B     C
0  a  c->d     1
2  1     2  3->4
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

How about a good ole list comprehension on the flattened contents...

import pandas as pd
import numpy as np

df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})

rows_different_mask = (df1 != df2).any(axis=1)

pairs = zip(df1.values.reshape(1, -1)[0], df2.values.reshape(1, -1)[0])
new_elems = ["%s->%s" %(old, new) if (old != new) else new for old, new in pairs]
df3 = pd.DataFrame(np.reshape(new_elems, df1.values.shape))
print df3

   0     1
0  a  c->d
1  a     c
dataflow
  • 475
  • 2
  • 12