Python sets: Strange behaviour with NaNs

Question

Can anyone explain the following strange behaviour in Python?

>>>set([np.nan, np.nan, np.nan])

{np.nan}

as expected, but:

>>>set(pd.Series([np.nan, np.nan, np.nan]))

{np.nan, np.nan, np.nan}

They're all just floats:

>>>[type(a) for a in set(pd.Series([np.nan, np.nan, np.nan]))]

[float, float, float]

How can this set have three objects that are the same?

Versions:

`nan`'s essentially break the contract for hashing: they are never equal to themselves, but they hash to the same value. Note: `float('nan') == float('nan')` will always be false. This leads to weird behavior in hashmaps (`dict` objects) and hash sets, (`set` objects). — juanpa.arrivillaga, May 29 '19 at 23:39
The first case likely has to do with there being a quick object identity test involved in sets (check if identity is the same since that's cheap, if identical, then assume equality, else, check equality which is potentially expensive). When you just use `np.nan` three times in a list, that is three references to the same object. When you put those into a series, it creates a `numpy` array of `float64` dtype, and when you iterate over it again, it produces 3 distinct `nan` `float` objects — juanpa.arrivillaga, May 29 '19 at 23:43
@juanpa.arrivillaga `float('nan')` produces the 2nd case. Pretty weird. — gmds, May 29 '19 at 23:45
Similar question on `nan` and `set` from 4 days ago: https://stackoverflow.com/questions/56290015/inconsistent-behavior-of-nans-in-python-numpy-pandas — hpaulj, May 29 '19 at 23:51
@gmds yes, because that will create a new object every time. — juanpa.arrivillaga, May 29 '19 at 23:53
Thanks! Interesting. For any future interested parties, this can be avoided by using df.drop_duplicates(). — Michael Dunne, May 30 '19 at 14:21

0 Answers0