2

Can anyone explain the following strange behaviour in Python?

>>>set([np.nan, np.nan, np.nan])

{np.nan}

as expected, but:

>>>set(pd.Series([np.nan, np.nan, np.nan]))

{np.nan, np.nan, np.nan}

They're all just floats:

>>>[type(a) for a in set(pd.Series([np.nan, np.nan, np.nan]))]

[float, float, float]

How can this set have three objects that are the same?

Versions:

  • Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
  • NumPy 1.15.1
  • Pandas 0.23.4
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
  • 3
    `nan`'s essentially break the contract for hashing: they are never equal to themselves, but they hash to the same value. Note: `float('nan') == float('nan')` will always be false. This leads to weird behavior in hashmaps (`dict` objects) and hash sets, (`set` objects). – juanpa.arrivillaga May 29 '19 at 23:39
  • The first case likely has to do with there being a quick object identity test involved in sets (check if identity is the same since that's cheap, if identical, then assume equality, else, check equality which is potentially expensive). When you just use `np.nan` three times in a list, that is three references to the same object. When you put those into a series, it creates a `numpy` array of `float64` dtype, and when you iterate over it again, it produces 3 distinct `nan` `float` objects – juanpa.arrivillaga May 29 '19 at 23:43
  • @juanpa.arrivillaga `float('nan')` produces the 2nd case. Pretty weird. – gmds May 29 '19 at 23:45
  • Similar question on `nan` and `set` from 4 days ago: https://stackoverflow.com/questions/56290015/inconsistent-behavior-of-nans-in-python-numpy-pandas – hpaulj May 29 '19 at 23:51
  • @gmds yes, because that will create a new object every time. – juanpa.arrivillaga May 29 '19 at 23:53
  • Thanks! Interesting. For any future interested parties, this can be avoided by using df.drop_duplicates(). – Michael Dunne May 30 '19 at 14:21

0 Answers0