Show complete rows highliting difference between dataframes df1 , df2 , but only when difference in row's cell exists

Question

i have two dataframes df1 and df2. Same index and same column names. how to construct a dataframe which shows difference, but only rows which have at least one different cell? if row has different cells, but some are same, keep same cells intact.

example:

df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})

output needed:

pd.DataFrame({1:['a'],2:['c->d']},index=[0])

output in this example should be one row dataframe, not dataframe including same rows

NB: output should only contain full rows which has at least one difference in cell

i'd like an efficient solution without iterating by rows , and without creating special-strings in DataFrame

score 1 · Accepted Answer · edited May 23 '17 at 10:29

1

You can use this brilliant solution:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{}->{}'.format(*x)

In [70]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[70]:
   1     2
0  a  c->d
1  a     c

For bit more complex DataFrames:

In [73]: df1
Out[73]:
   A  B  C
0  a  c  1
1  a  c  2
2  1  2  3

In [74]: df2
Out[74]:
   A  B  C
0  a  d  1
1  a  c  2
2  1  2  4

In [75]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[75]:
   A     B     C
0  a  c->d     1
1  a     c     2
2  1     2  3->4

UPDATE: showing only changed/different rows:

In [54]: mask = df1.ne(df2).any(1)

In [55]: mask
Out[55]:
0     True
1    False
2     True
dtype: bool

In [56]: pd.Panel(dict(df1=df1[mask],df2=df2[mask])).apply(report_diff, axis=0)
Out[56]:
   A     B     C
0  a  c->d     1
2  1     2  3->4

edited May 23 '17 at 10:29

Community

1
1

answered Mar 02 '17 at 22:47

MaxU - stand with Ukraine

205,989
36
386
419

1

I like it. Much more pandastic – dataflow Mar 02 '17 at 22:58
yes i like report_diff solution , but i'd like my output dataframe without same rows. i edited the question to highlight it. dataframe can be 100MB and can contain any string or number – alexprice Mar 03 '17 at 09:57
@alexprice, please see UPDATE – MaxU - stand with Ukraine Mar 03 '17 at 11:29

score 1 · Answer 2 · answered Mar 02 '17 at 22:54

How about a good ole list comprehension on the flattened contents...

import pandas as pd
import numpy as np

df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})

rows_different_mask = (df1 != df2).any(axis=1)

pairs = zip(df1.values.reshape(1, -1)[0], df2.values.reshape(1, -1)[0])
new_elems = ["%s->%s" %(old, new) if (old != new) else new for old, new in pairs]
df3 = pd.DataFrame(np.reshape(new_elems, df1.values.shape))
print df3

   0     1
0  a  c->d
1  a     c

Show complete rows highliting difference between dataframes df1 , df2 , but only when difference in row's cell exists

2 Answers2