4

Lets say I am updating my dataframe with another dataframe (df2)

import pandas as pd
import numpy as np

df=pd.DataFrame({'axis1': ['Unix','Window','Apple','Linux'],
                 'A': [1,np.nan,1,1],
                 'B': [1,np.nan,np.nan,1],
                 'C': [np.nan,1,np.nan,1],
                 'D': [1,np.nan,1,np.nan],
                 }).set_index(['axis1'])

print (df)

df2=pd.DataFrame({'axis1': ['Unix','Window','Apple','Linux','A'],
                 'A': [1,1,np.nan,np.nan,np.nan],
                 'E': [1,np.nan,1,1,1],
                 }).set_index(['axis1'])

df = df.reindex(columns=df2.columns.union(df.columns),
                index=df2.index.union(df.index))

df.update(df2)

print (df)

Is there a command to get the number of cells that were updated? (changed from Nan to 1) I want to use this to track changes to my dataframe.

ccsv
  • 8,188
  • 12
  • 53
  • 97

1 Answers1

0

There is no built in method in pandas I can think of, you would have to save the original df prior to the update and then compare, the trick is to ensure that NaN comparisons are treated the same as non-zero values, here df3 is a copy of df prior to the call to update:

In [104]:

df.update(df2)
df
Out[104]:
         A   B   C   D   E
axis1                     
A      NaN NaN NaN NaN   1
Apple    1 NaN NaN   1   1
Linux    1   1   1 NaN   1
Unix     1   1 NaN   1   1
Window   1 NaN   1 NaN NaN

[5 rows x 5 columns]
In [105]:

df3
Out[105]:
         A   B   C   D   E
axis1                     
A      NaN NaN NaN NaN NaN
Apple    1 NaN NaN   1 NaN
Linux    1   1   1 NaN NaN
Unix     1   1 NaN   1 NaN
Window NaN NaN   1 NaN NaN

[5 rows x 5 columns]
In [106]:

# compare but notice that NaN comparison returns True
df!=df3
Out[106]:
            A      B      C      D     E
axis1                                   
A        True   True   True   True  True
Apple   False   True   True  False  True
Linux   False  False  False   True  True
Unix    False  False   True  False  True
Window   True   True  False   True  True

[5 rows x 5 columns]

In [107]:
# use numpy count_non_zero for easy counting, note this gives wrong result
np.count_nonzero(df!=df3)
Out[107]:
16

In [132]:

~((df == df3) | (np.isnan(df) & np.isnan(df3)))
Out[132]:
            A      B      C      D      E
axis1                                    
A       False  False  False  False   True
Apple   False  False  False  False   True
Linux   False  False  False  False   True
Unix    False  False  False  False   True
Window   True  False  False  False  False

[5 rows x 5 columns]
In [133]:

np.count_nonzero(~((df == df3) | (np.isnan(df) & np.isnan(df3))))
Out[133]:
5
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • I submitted this as an enhancement for pandas https://github.com/pydata/pandas/issues/6891 Not sure why this was not in. – ccsv Apr 16 '14 at 09:59
  • @ccsv don't know, I've never thought about needing this but I can see why this would be useful, it would be problematic and there would be a performance penalty if this was implemented for all methods that updated a dataframe/series/panel. I think that my method would generate a copy according to this: http://stackoverflow.com/questions/10819715/comparing-numpy-arrays-so-that-nans-compare-equal and comparing `NaN` is problematic – EdChum Apr 16 '14 at 10:01
  • I use it to log changes. I believe performance problems can probably be solved using `cython` – ccsv Apr 16 '14 at 10:47