Python NaN's in set and uniqueness

Question

I just stumbled across this interesting behavior of Python involving NaN's in sets:

# Test 1
nan = float('nan')
things = [0, 1, 2, nan, 'a', 1, nan, 'a', 2, nan, nan]
unique = set(things)
print(unique)  # {0, 1, 2, nan, 'a'}

# Test 2
things = [0, 1, 2, float('nan'), 'a', 1, float('nan'), 'a', 2, float('nan'), float('nan')]
unique = set(things)
print(unique)  # {0, 1, 2, nan, nan, nan, nan, 'a'}

That the same key nan shows up multiple times within the last set of course seems strange.

I believe this is caused by nan not being equal to itself (as defined by IEEE 754), together with the fact that objects are compared based on memory location (id()) prior to equality of values, when adding objects to a set. It then appears that each float('nan') results in a fresh object, rather than returning some global "singleton" float (as is done for e.g. the small integers).

In fact I just found this SO question describing the same behavior, seemingly confirming the above.

Questions:

Is this really desired behavior?
Say I was given the second things from above. How would I go about counting the number of actually unique elements? The usual len(set(things)) obviously does not work. I can in fact use import numpy as np; len(np.unique(things)), but I would like to know if this can be done without using third-party libraries.

Addendum

As a small addendum, let me add that a similar story holds for dicts:

d = {float('nan'): 0, float('nan'): 1}
print(d)  # {nan: 0, nan: 1}

I was under the impression that NaN's were a total no-go as keys in dicts, but it does actually work out as long as you store references to the exact objects used as keys:

nan0 = float('nan')
nan1 = float('nan')
d = {nan0: 0, nan1: 1}
d[float('nan')]  # KeyError
d[nan0]  # 0
d[nan1]  # 1

Surely hacky, but I can see this trick being useful if one is in need of storing additional values in an existing dict, and one does not care about which keys to use, except of course that each new key has to not be in the dict already. That is, one can use float('nan') as a factory for generating an unending supply of new dict keys, guaranteed to never collide with each other, existing or future keys.

There is often no one-size-fits-all solution for NaN issues. Suppose you have some calculation that would have produced four different results if performed with true real-number arithmetic but, because of limitations of floating-point arithmetic, it produced 1, NaN, 3, NaN. In this case, the ideal answer to how many elements are in the set of the results would be four. So the NaNs should be counted as different things. On the other hand, in some other calculation, the ideal results might have been 7, 7, 7, 7, but the computed results were 7, NaN, 7, NaN. In this case,… — Eric Postpischil, Nov 17 '20 at 12:35
… the ideal answer to how many elements are in the set of the results would be one. So the NaNs should be counted as the same. But the nature of a NaN is that it is not known what it would have been in an ideal situation. So we cannot know how to process it, in the absence of other information. Therefore, you have NaNs in your data, it is largely incumbent upon you to deal with them, not to expect that library routines will handle them for you. — Eric Postpischil, Nov 17 '20 at 12:36
as for your first question, if you use `class foo() : pass , a = foo()` and printing out the result of `b = [a,a]` and `c = [foo(), foo()]` shows as `b` having 2 object with same address, while `c` shows two different objects. I think it has something to do with float() also being a class object? you can also try `set(b)` and `set(c)` and it results the same as your case — bmjeon5957, Nov 17 '20 at 12:37
@EricPostpischil Fair, but say that I'm in a situation where I really want to treat NaN's as in the question. Not all NaN's are the result of bad arithmetic, some are deliberately inserted as placeholders. — jmd_dk, Nov 17 '20 at 12:40
@jmd_dk use None as a placeholder rather than NaN. Also if you can give examples where NaN is required as a placeholder, it will be nice — Aaj Kaal, Nov 17 '20 at 13:00
@AajKaal NaN is often used to pad data (text or binary) files containing columns of `float`s, or more generally to fill in missing values. Here *some* `float` is required, ruling out `None` (which I agree is otherwise the way to go). — jmd_dk, Nov 17 '20 at 13:05
re `dict` different objects (even as keys) are different things. Don't confuse yourself with the "same" string representation. — tuergeist, Nov 17 '20 at 13:55

tuergeist · Answer 1 · 2020-11-17T13:00:44.613

The desired behavior of float() is to return an instance of float (class). and, you're right 'nan' is not equal to itself. Thus, float(1) == float(1) whereas float('nan') != float('nan')

To get a unique set I'd recommend establishing a nan const as you did in Test 1. If this won't fit for you, you could go with import math; math.isnan(float('nan')). Iterate over the list (or set) and remove the elements. newlist = [ x for x in things if not math.isnan(x) ]

You might think: No I remove all nans. What is if there was one in before?

import math

things = [0, 1, 2, float('nan'), 'a', 1, float('nan'), 'a', 2, float('nan'), float('nan')]
nan = float('nan')
length = len(things)
newlist = [ x for x in things if not isinstance(x, str) and not math.isnan(x) ]
if len(newlist) != length:
    newlist.append(nan)  # or however you'd like to handle it
unique = set(newlist)
print(unique)

{0, 1, 2, nan}

`float(np.nan) == float(np.nan)` gives like `float('nan') == float('nan')` False while `things = [0, 1, 2, float(np.nan), 'a', 1, float(np.nan), 'a', 2, float(np.nan), float(np.nan)]` gives like test 1 {0, 1, 2, nan, 'a'} . Why is that? — Ruthger Righart, Nov 17 '20 at 13:15
@RuthgerRighart `float(np.nan)` always returns the same object, `float('nan')` does not. It returns always a new object. Run `float('nan').__repr__` twice, retry with `np.nan` — tuergeist, Nov 17 '20 at 13:39

Python NaN's in set and uniqueness

Addendum

1 Answers1