13

I ran into an unpredicted behavior with Python's Numpy, set and NaN (not-a-number):

>>> set([np.float64('nan'), np.float64('nan')])
set([nan, nan])
>>> set([np.float32('nan'), np.float32('nan')])
set([nan, nan])
>>> set([np.float('nan'), np.float('nan')])
set([nan, nan])
>>> set([np.nan, np.nan])
set([nan])
>>> set([float('nan'), float('nan')])
set([nan, nan])

Here np.nan yields a single element set, while Numpy's nans yield multiple nans in a set. So does float('nan')! And note that:

>>> type(float('nan')) == type(np.nan)
True

I wonder how this difference come about and what the rationality is behind the different behaviors.

Finn Årup Nielsen
  • 6,130
  • 1
  • 33
  • 43
  • 1
    It looks like `numpy.nan` is a singleton. Hence each of its instance has the same identity. – Ashwini Chaudhary Apr 09 '15 at 17:05
  • 1
    look at `id(np.nan)' v `id(np.float64('nan'))` (for repeated instances). – hpaulj Apr 09 '15 at 17:11
  • `[id(np.float64('nan')) for n in range(10)]` gives `[65159576, 65159576, 65159576, 65159576, 65159576, 65159576, 65159576, 65159576, 65159576, 65159576]` and `[id(np.nan) for n in range(10)]` gives `[35133032, 35133032, 35133032, 35133032, 35133032, 35133032, 35133032, 35133032, 35133032, 35133032]` – Finn Årup Nielsen Apr 09 '15 at 17:19
  • 1
    @FinnÅrupNielsen the NAN object you create in that circumstance is being destroyed each time since it's a temporary with no references, and the object location is being reused. That's why you get the same id each time. – Mark Ransom Apr 09 '15 at 17:24
  • 1
    @FinnÅrupNielsen Different result for me with numpy 1.8.0, it's 2 different ID's for `[id(np.float64('nan')) for n in range(10)]`. That's because we are throwing objects away hence CPython can re-use the memory space. Try with: `x = [np.float64('nan') for n in range(10)]; [id(y) for y in x]` – Ashwini Chaudhary Apr 09 '15 at 17:25
  • A very related question is this one: http://stackoverflow.com/questions/26245862 – Finn Årup Nielsen Apr 09 '15 at 18:21
  • Remember that `np.float is float`, so your 3rd and 5th test are the same – Eric Jul 04 '17 at 11:57

1 Answers1

11

One of the properties of NAN is that NAN != NAN, unlike all other numbers. However, the implementation of set first checks to see if id(x) matches the existing member at a hash index before it tries to insert a new one. If you have two objects with different ids that both have the value NAN, you'll get two entries in the set. If they both have the same id, they collapse into a single entry.

As pointed out by others, np.nan is a single object that will always have the same id.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 2
    Regarding the" always have the same id", some things are odd: if `l = np.array([np.nan, np.nan])`, then `id(l[0]), id(l[1]), [id(x) for x in l]` is `(946651263888, 946651263888, [946651263888, 946651263912])`. – fuglede Oct 08 '18 at 09:32
  • @fuglede this is a great comment! any one with ideas why it is this case? – episodeyang Jul 12 '19 at 03:11
  • @episodeyang that's a good question. I would expect `for x` to be consistent, either giving a reference to the actual object or making a copy. It appears there's some subtlety to it that I haven't figured out yet. – Mark Ransom Jul 12 '19 at 17:18
  • Also note that there's no guarantee that subsequent calls to `id(l[0])` give the same result, and that if you had used `list(map(id, l))` instead of the comprehension, you would get duplicated ids (but `list(map(id, list(l)))` does create a copy); arrays are fun. This behavior is not specific to `np.nan`; the same would work with, say, an integer. – fuglede Jul 13 '19 at 10:00
  • 1
    The integer case is also interesting in itself in that the objects you get out when using low integers need not be the singletons maintained by CPython; compare `[id(x) for x in np.array([2, 2])]` and `[id(x) for x in [2, 2]]`. So all of this really says more about array behavior than anything, but it does say that you can't always assume that two things that both look like `np.nan` will necessarily have the same `id`. In particular, in the example provided in the original post, `set(np.array([np.nan, np.nan]))` would have given a two-element set. – fuglede Jul 13 '19 at 10:13
  • @fuglede Any suggestions to work around this and count the number of distinct elements in an array in a one-liner? That is, counting np.nan as a single entity in all it's appearances. – drevicko Feb 20 '23 at 07:12
  • 1
    @drevicko: if you just want something short, `np.unique` does that out of the box and by default (note the `equal_nan` parameter), so for an input `a`, you are just looking for `len(np.unique(a))`. – fuglede Feb 20 '23 at 10:36