5

I noticed a problem converting lists of NaN values to sets:

import pandas as pd
import numpy as np

x = pd.DataFrame({'a':[None,None]})
x_numeric = pd.to_numeric(x['a']) #converts to numpy.float64
set(x_numeric)

This SHOULD return {nan} but instead returns {nan, nan}. However, doing this:

set([numpy.nan, numpy.nan])

returns the expected {nan}. The former are apparently class numpy.float64, while the latter are by default class float.

Any idea why set() doesn't work with numpy.float64 NaN values? I'm using Pandas version 0.18 and Numpy version 1.10.4.

tom
  • 205
  • 2
  • 7
  • 1
    In numpy two nan's are not equal. In a list they may be identical but not in numpy array. To find out try `set(np.array([np.nan,np.nan]))`. In pandas they series will be in numpy array format – Bharath M Shetty Oct 20 '17 at 05:22
  • 2
    `x_numeric.unique()` returns only `[nan]`, this is interesting. – cs95 Oct 20 '17 at 05:23
  • Well Im confused now a bit more. – Bharath M Shetty Oct 20 '17 at 05:24
  • @cᴏʟᴅsᴘᴇᴇᴅ That fixes my immediate problem, thanks! Oddly np.unique(x_numeric) still returns {nan, nan}. – tom Oct 20 '17 at 05:34
  • @tom Glad I could help. Unfortunately, I don't know the reason for it, so I'm not posting an answer. – cs95 Oct 20 '17 at 05:54
  • @Bharathshetty reason is optimization of set, it first checks the id, rather than for equality, see my answer (though I guess I could add some pseudocode to explain what it is set does here). – Andy Hayden Oct 20 '17 at 06:37
  • @cᴏʟᴅsᴘᴇᴇᴅ my suspicion is that .unique(), written in cython, (correctly) doesn't "care" about the contents of the bytes when doing the uniqueness (i.e. sees NaN no different from any other float64) – Andy Hayden Oct 20 '17 at 06:38
  • 1
    @AndyHayden I see! Thanks for the answer as well, it was very informative. See if you can answer [mine](https://stackoverflow.com/questions/46842793/datetime-conversion-how-to-extract-the-inferred-format) too.. :-) – cs95 Oct 20 '17 at 06:41
  • Yeah even I want to know the answer for your qn @cᴏʟᴅsᴘᴇᴇᴅ. – Bharath M Shetty Oct 20 '17 at 06:45
  • Also related: https://stackoverflow.com/q/45300367/102441 – Eric Oct 20 '17 at 07:29
  • @Eric It feels a bit of a shame to dupe hammer it, as that question isn't better answered (I know that's not really the criteria, but it is the outcome: low rep users/not logged in are redirected there and will never see this page). I have a feeling there is a much earlier original dupe, but I couldn't find (either) before. – Andy Hayden Oct 20 '17 at 17:01
  • Wasn't aware that low rep users never saw dupes. I could try flipping the dupe hammer, if you think that would be better? Also, while not better _answered_, it is better _asked_, as it takes `pandas` out of the loop - so maybe you should just post a better answer there – Eric Oct 20 '17 at 17:04

2 Answers2

7

NaNs in a float64 array don't point to the same space in memory as np.NaN, (they, like every other number in the array, 8 bytes in the array). We can see this when we take the id:

In [11]: x_numeric
Out[11]:
0   NaN
1   NaN
Name: a, dtype: float64

In [12]: x_numeric.apply(id)
Out[12]:
0    4657312584
1    4657312536
Name: a, dtype: int64

In [13]: id(np.nan)
Out[13]: 4535176264

In [14]: id(np.nan)
Out[14]: 4535176264

It's kindof a python "gotcha" that this occurs, since it's an optimization (before checking set equality python checks if it's the same object: has the same id / location in memory):

In [21]: s = set([np.nan])

In [22]: np.nan in s
Out[22]: True

In [23]: x_numeric.apply(lambda x: x in s)
Out[23]:
0    False
1    False
Name: a, dtype: bool

The reason it's a "gotcha" is because NaN, unlike most objects is not equal to itself:

In [24]: np.nan == np.nan
Out[24]: False
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
2

Numpy is a red herring here - np.nan is just a name for float('nan'), which shows the same problem:

>>> a = float('nan')
>>> b = float('nan')
>>> {a, b}
{nan, nan}
>>> {a, a}
{nan}

As Andy says, this is about set equality trying x is y before x == y when checking for set membership.

Eric
  • 95,302
  • 53
  • 242
  • 374