37

I used to believe that in operator in Python checks the presence of element in some collection using equality checking ==, so element in some_list is roughly equivalent to any(x == element for x in some_list). For example:

True in [1, 2, 3]
# True because True == 1

or

1 in [1., 2., 3.]
# also True because 1 == 1.

However, it is well-known that NaN is not equal to itself. So I expected that float("NaN") in [float("NaN")] is False. And it is False indeed.

However, if we use numpy.nan instead of float("NaN"), the situation is quite different:

import numpy as np
np.nan in [np.nan, 1, 2]
# True

But np.nan == np.nan still gives False!

How is it possible? What's the difference between np.nan and float("NaN")? How does in deal with np.nan?

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
Ilya V. Schurov
  • 7,687
  • 2
  • 40
  • 78

2 Answers2

36

To check if the item is in the list, Python tests for object identity first, and then tests for equality only if the objects are different.1

float("NaN") in [float("NaN")] is False because two different NaN objects are involved in the comparison. The test for identity therefore returns False, and then the test for equality also returns False since NaN != NaN.

np.nan in [np.nan, 1, 2] however is True because the same NaN object is involved in the comparison. The test for object identity returns True and so Python immediately recognises the item as being in the list.

The __contains__ method (invoked using in) for many of Python's other builtin Container types, such as tuples and sets, is implemented using the same check.


1 At least this is true in CPython. Object identity here means that the objects are found at the same memory address: the contains method for lists is performed using PyObject_RichCompareBool which quickly compares object pointers before a potentially more complicated object comparison. Other Python implementations may differ.

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • 2
    Yupp. `nan = float("NaN"); nan in [nan]` gives `True`. Thanks! – Ilya V. Schurov Dec 08 '17 at 20:36
  • Is there any benefit of doing this (first identity then equality). Why not check equality directly? I am asking this because I always thought nan is the only object for which x is x holds true but x!=x. Seeing this I am wondering if there are others? – ayhan Dec 08 '17 at 20:42
  • 2
    @ayhan - Checking identity is a relatively cheap operation (just compare memory addresses). Checking equality may be arbitrarily expensive. – John Y Dec 08 '17 at 20:49
  • 2
    @John A professor of mine used to say if it doesn't have to be correct I can make it arbitrarily fast. [The documentation](https://docs.python.org/3/library/stdtypes.html#common-sequence-operations) says for `x in s`: `True if an item of s is equal to x, else False`. Seems like a bug - whether documentation or implementation is up for debate. Considering the consequences, it should probably just be documented what the `in` operator actually does. – Voo Dec 09 '17 at 00:35
  • @Voo since `nan` is the only thing that breaks this it is apparently not considered important enough. See the [rejection notice](https://www.python.org/dev/peps/pep-0754/#rejection-notice) to PEP 754 _This PEP has been rejected. After sitting open for four years, it has failed to generate sufficient community interest._ – Paul Panzer Dec 09 '17 at 09:54
  • @Paul It would also happen for any custom class that defines a weird equality operator I'd think, but I agree that it's a problem that probably only comes up with nan in practice. – Voo Dec 09 '17 at 11:35
6

One thing worth mentioning is that numpy arrays do behave as expected:

a = np.array((np.nan,))
a[0] in a
# False

Variations of the theme:

[np.nan]==[np.nan]
# True
[float('nan')]==[float('nan')]
# False
{np.nan: 0}[np.nan]
# 0
{float('nan'): 0}[float('nan')]
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# KeyError: nan

Everything else is covered in @AlexRiley's excellent answer.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Interseting, that it works as expected even for `dtype=object`. – Ilya V. Schurov Dec 08 '17 at 21:02
  • `in` for NumPy arrays is [implemented](https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/sequence.c#L28-L46) as `(array == item).any()` as your answer shows. I guess the developers were forced to chose this approach since NumPy arrays are not internally collections of references to objects and so ids cannot be compared. – Alex Riley Dec 08 '17 at 21:24
  • Yep, numpy appears to always compare by value: `a = np.array([None,[np.nan]]); a[1] in a` is also `False`. – Paul Panzer Dec 08 '17 at 21:27
  • @AlexRiley actually, in the example I just gave we have `a[1] == [np.nan]` -> `True` but `a == [np.nan]` -> `array([False, False], dtype=bool)` which I do find puzzling. Can you explain that? – Paul Panzer Dec 08 '17 at 21:41
  • I think it's because with `a[1] == [np.nan]` we are comparing two lists that contain the same NaN object and the list equality check uses `PyObject_RichCompareBool` to compare items (so ids are checked). For `a == [np.nan]`, it is the NumPy array's equality method is called and this method does not (or cannot) check ids of values so `a[1]` and `[np.nan]` are not seen as equal. (I believe the code for object arrays that is executed is [here](https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/scalartypes.c.src#L1174-L1208)). – Alex Riley Dec 08 '17 at 21:59
  • Wow. `[np.nan]==[np.nan]` is quite surprising as well. – Ilya V. Schurov Dec 08 '17 at 22:06
  • Ilya, @AlexRiley actually, I think I know what's going on and I think it's a genuine bug not related to `nan`. The problem is with the implementation of `__contains__` you mention above. If `item` happens to be a list it is broadcast. As a consequence `[5] in np.array([None, 4, [5]])` -> `False` and `[5] in np.array([None, 5, [4]])` -> `True`! – Paul Panzer Dec 08 '17 at 22:25
  • @PaulPanzer, Hmm… I still don't understand why `[5] in np.array([None, 4, [5]])` is `False` and `[5] in np.array([None, 5, [4]])`. (Probably, we should open another question.) – Ilya V. Schurov Dec 08 '17 at 22:52
  • 2
    As I said I think it's a bug. When arr==item is evaluated (item being [5]) then [5] is broadcast to something like (conceptually) [5, 5, 5] so the list-ness is lost in translation. I've opened an issue at the numpy tracker. – Paul Panzer Dec 08 '17 at 23:03