6

I am using the python shift function to compare if a value in a Series is equal to the previus value. Basically

import pandas as pd

a = pd.Series([2, 2, 4, 5])

a == a.shift()
Out[1]: 
0    False
1     True
2    False
3    False
dtype: bool

This is as expected. (The first comparison is False because we are comparing with the NA of the shifted series). Now, I do have Series where I don't have any value, ie. None, like this

b = pd.Series([None, None, 4, 5])

Here the comparison of the two Nones gives False

b == b.shift()
Out[3]: 
0    False
1    False
2    False
3    False
dtype: bool

I'd be willing to accept some sort of philosophical reasoning arguing that comparing None is meaningless etc., however

c = None
d = None
c == d
Out[4]: True

What is going on here?!

And, what I really want to know is; how can I perform my comparison of my b-Series, given that I want it to treat None's as equal? That is I want b == b.shift() to give the same result as a == a.shift() gave.

mortysporty
  • 2,749
  • 6
  • 28
  • 51

3 Answers3

4

The None get casted to NaN and NaN has the property that it is not equal to itself:

[54]:
b = pd.Series([None, None, 4, 5])
b

Out[54]: 
0    NaN
1    NaN
2    4.0
3    5.0
dtype: float64

As you can see here:

In[55]:
b==b

Out[55]: 
0    False
1    False
2     True
3     True
dtype: bool

I'm not sure how you can get this to work correctly, although this works:

In[68]:
( (b == b.shift())  | ( (b != b.shift()) &  (b != b) ) )

Out[68]: 
0     True
1     True
2    False
3    False
dtype: bool

You'll get a false result for the first row because when you shift down you're comparing against a non-existent row:

In[69]:
b.shift()

Out[69]: 
0    NaN
1    NaN
2    NaN
3    4.0
dtype: float64

So the NaN is comparing True from the boolean logic as the first row is NaN and so is the shifted series' first row.

To work around the first row False-positive you could slice the resultant result to ignore the first row:

In[70]:
( (b == b.shift())  | ( (b != b.shift()) &  (b != b) ) )[1:]

Out[70]: 
1     True
2    False
3    False
dtype: bool

As to why it gets casted, Pandas tries to coerce the data to a compatible numpy, here float is selected because of the ints and None values, None and NaN cannot be represented by ints

To get the same result as a in your example, you should overwrite the first row to False as it should always fail:

In[78]:
result = pd.Series( ( (b == b.shift())  | ( (b != b.shift()) &  (b != b) ) ) )
result.iloc[0] = False
result

Out[78]: 
0    False
1     True
2    False
3    False
dtype: bool
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 1
    For parties interested to know how this works, refer to https://stackoverflow.com/questions/44864912/how-are-inf-and-nan-implemented – cs95 Aug 17 '17 at 14:23
  • 1
    Another good SO about NaN and boolean logic. https://stackoverflow.com/questions/43925797/why-python-pandas-does-not-use-3-valued-logic/43925913#43925913 – Scott Boston Aug 17 '17 at 14:28
  • Aha! :) Thanks. Any thoughts on the last part of the question? I really dont want to introduce a dummy like 999 or similar – mortysporty Aug 17 '17 at 14:35
  • You may want to slice the series so you ignore the first row perhaps, I'll update – EdChum Aug 17 '17 at 14:36
  • I'll give it a go a little later (currently commuting :)) and mark your reply as the answer if it works! – mortysporty Aug 17 '17 at 14:42
  • It'd make sense to set the first row to `False` as it should always fail in the comparison, I'll show how to do this – EdChum Aug 17 '17 at 14:42
1

If it is okay for you to compare neighboring entries in a periodic manner (ie., the last entry is compared to the first one) there is another, simple solution using the numpy roll function:

import numpy as np

b = [None, None, 4, 5] 
# or list(b) if b is a pandas Series

np.roll(b,1) == b

Returns:

> array([False,  True, False, False])
David
  • 1,909
  • 1
  • 20
  • 32
0

As indicated here, None compares to nan who does not compare equal in Pandas/numpy.

But for the None, you can easily have a nice workaroud, using apply :

In[1]:
foo = pd.Series([None, 'a'])
foo==None

Out[1]:
0    False
1    False
dtype: bool 


In[2]:
foo.apply(lambda a:a==None)
Out[2]: 
0     True
1    False
dtype: bool