Python's 'set' operator doesn't work with numpy.nan

Question

I noticed a problem converting lists of NaN values to sets:

import pandas as pd
import numpy as np

x = pd.DataFrame({'a':[None,None]})
x_numeric = pd.to_numeric(x['a']) #converts to numpy.float64
set(x_numeric)

This SHOULD return {nan} but instead returns {nan, nan}. However, doing this:

set([numpy.nan, numpy.nan])

returns the expected {nan}. The former are apparently class numpy.float64, while the latter are by default class float.

Any idea why set() doesn't work with numpy.float64 NaN values? I'm using Pandas version 0.18 and Numpy version 1.10.4.

In numpy two nan's are not equal. In a list they may be identical but not in numpy array. To find out try `set(np.array([np.nan,np.nan]))`. In pandas they series will be in numpy array format — Bharath M Shetty, Oct 20 '17 at 05:22
`x_numeric.unique()` returns only `[nan]`, this is interesting. — cs95, Oct 20 '17 at 05:23
@cᴏʟᴅsᴘᴇᴇᴅ That fixes my immediate problem, thanks! Oddly np.unique(x_numeric) still returns {nan, nan}. — tom, Oct 20 '17 at 05:34
@tom Glad I could help. Unfortunately, I don't know the reason for it, so I'm not posting an answer. — cs95, Oct 20 '17 at 05:54
@Bharathshetty reason is optimization of set, it first checks the id, rather than for equality, see my answer (though I guess I could add some pseudocode to explain what it is set does here). — Andy Hayden, Oct 20 '17 at 06:37
@cᴏʟᴅsᴘᴇᴇᴅ my suspicion is that .unique(), written in cython, (correctly) doesn't "care" about the contents of the bytes when doing the uniqueness (i.e. sees NaN no different from any other float64) — Andy Hayden, Oct 20 '17 at 06:38
@AndyHayden I see! Thanks for the answer as well, it was very informative. See if you can answer [mine](https://stackoverflow.com/questions/46842793/datetime-conversion-how-to-extract-the-inferred-format) too.. :-) — cs95, Oct 20 '17 at 06:41
Yeah even I want to know the answer for your qn @cᴏʟᴅsᴘᴇᴇᴅ. — Bharath M Shetty, Oct 20 '17 at 06:45
@Eric It feels a bit of a shame to dupe hammer it, as that question isn't better answered (I know that's not really the criteria, but it is the outcome: low rep users/not logged in are redirected there and will never see this page). I have a feeling there is a much earlier original dupe, but I couldn't find (either) before. — Andy Hayden, Oct 20 '17 at 17:01
Wasn't aware that low rep users never saw dupes. I could try flipping the dupe hammer, if you think that would be better? Also, while not better _answered_, it is better _asked_, as it takes `pandas` out of the loop - so maybe you should just post a better answer there — Eric, Oct 20 '17 at 17:04

score 7 · Accepted Answer · answered Oct 20 '17 at 06:32

NaNs in a float64 array don't point to the same space in memory as np.NaN, (they, like every other number in the array, 8 bytes in the array). We can see this when we take the id:

In [11]: x_numeric
Out[11]:
0   NaN
1   NaN
Name: a, dtype: float64

In [12]: x_numeric.apply(id)
Out[12]:
0    4657312584
1    4657312536
Name: a, dtype: int64

In [13]: id(np.nan)
Out[13]: 4535176264

In [14]: id(np.nan)
Out[14]: 4535176264

It's kindof a python "gotcha" that this occurs, since it's an optimization (before checking set equality python checks if it's the same object: has the same id / location in memory):

In [21]: s = set([np.nan])

In [22]: np.nan in s
Out[22]: True

In [23]: x_numeric.apply(lambda x: x in s)
Out[23]:
0    False
1    False
Name: a, dtype: bool

The reason it's a "gotcha" is because NaN, unlike most objects is not equal to itself:

In [24]: np.nan == np.nan
Out[24]: False

Eric · Answer 2 · 2017-10-20T07:30:32.537

2

Numpy is a red herring here - np.nan is just a name for float('nan'), which shows the same problem:

>>> a = float('nan')
>>> b = float('nan')
>>> {a, b}
{nan, nan}
>>> {a, a}
{nan}

As Andy says, this is about set equality trying x is y before x == y when checking for set membership.

edited Oct 20 '17 at 07:30

answered Oct 20 '17 at 07:23

Eric

95,302
53
242
374

Python's 'set' operator doesn't work with numpy.nan

2 Answers2