I am trying to see if any values change in my dataframe, from row to row. One column in particular has a lot of None
values.
If I check for inequality between None
values, element-wise, it gives the answer I expect (that is, None != None
returns False
), but not when I do that same comparison on the column.
Here's a minimal working example:
>>> example_data = [None]*3 + ['a','b','b'] + [None]*4
>>> df = pd.DataFrame(example_data, columns = ['col1'])
>>> print df
col1
0 None
1 None
2 None
3 a
4 b
5 b
6 None
7 None
8 None
9 None
To see if the value changed from one line to the next, I'm shifting everything and comparing the values:
>>> did_it_change = (df != df.shift(1))
>>> print did_it_change
col1
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 True
So long as None
is not involved in the comparison, it's accurate (e.g. row 5 changed the value to b
from its previous value of b
, so it did not change).
If I check a single element, it works as expected:
>>> df.loc[1,'col1'] != df.loc[2,'col1']
False
But that's not the answer that appears in the full column: did_it_change.loc[1,:]
is True
.
I do get the expected outcome if I use ""
instead of None
, but replacing None
with empty string like in this question seems arbitrary and superfluous.
This is the outcome I'd expect.
>>> example_data = [""]*3 + ['a','b','b'] + [""]*4
>>> df = pd.DataFrame(example_data, columns = ['col1'])
>>> did_it_change = (df != df.shift(1))
>>> print did_it_change
col1
0 True
1 False
2 False
3 True
4 True
5 False
6 True
7 False
8 False
9 False
What's going on here? Why does the comparison with None
give different answers if I'm calling it on a dataframe column instead of its elements?