comparing two DataFrames, specific questions

Question

I was read Andy's answer to the question Outputting difference in two Pandas dataframes side by side - highlighting the difference

i have two questions regarding the code, unfortunately I dont yet have 50 rep to comment on the answer so I hope i could get some help here.

what does In [24]: changed = ne_stacked[ne_stacked] do? I'm not sure what df1 = df[df] do and i cant seem to get an answer from pandas doc, could someone explain this to me please?
is np.where(df1 != df2) the same as pd.df.where(df1 != df2). If no, what is the difference?

score 4 · Accepted Answer · answered May 30 '17 at 22:56

Question 1

ne_stacked is a pd.Series that consists of True and False values that indicate where df1 and df2 are not equal.

ne_stacked[boolean_array] is a way to filter the series ne_stacked by eliminating the rows of ne_stacked where boolean_array is False and keeping the rows of ne_stacked where boolean_array is True.

It so happens that ne_stacked is also a boolean array and so can be used to filter itself. Why would be want to do this? So we can see what the values of the index are after we've filtered.

So ne_stacked[ne_stacked] is a subset of ne_stacked with only True values.

Question 2

np.where

np.where does two things, if you only pass a conditional like in np.where(df1 != df2), you get a tuple of arrays where the first is a reference of all row indices to be used in conjunction with the second element of the tuple that is a reference to all column indices. I usually use it like this

i, j = np.where(df1 != df2)

Now I can get at all elements of df1 or df2 in which there are differences like

df.values[i, j]

Or I can assign to those cells

df.values[i, j] = -99

Or lots of other useful things.

You can also use np.where as an if, then, else for arrays

np.where(df1 != df2, -99, 99)

To produce an array the same size as df1 or df2 where you have -99 in all the places where df1 != df2 and 99 in the rest.

df.where

On the other hand df.where evaluates the first argument of boolean values and returns an object of equal size to df where the cells that evaluated to True are kept and the rest are either np.nan or the values passed in the second argument of df.where

df1.where(df1 != df2)

Or

df1.where(df1 != df2, -99)

are they the same?
Clearly they are not the "same". But you can use them similarly

np.where(df1 != df2, df1, -99)

Should be the same as

df1.where(df1 != df2, -99).values

comparing two DataFrames, specific questions

1 Answers1

Linked

Related