0

Consider the following pandas.Series:

import pandas as pd
import numpy as np
s = pd.Series([np.nan, 1, 1, np.nan])

s
0    NaN
1    1.0
2    1.0
3    NaN
dtype: float64

I want to find only unique values in this particular series using the built-in set function:

unqs = set(s)

unqs
{nan, 1.0, nan}

Why are there duplicate NaNs in the resultant set? Using a similar function (pandas.unique) does not produce this result, so what's the difference, here?

pd.unique(s)
array([ nan,   1.])
blacksite
  • 12,086
  • 10
  • 64
  • 109
  • 2
    Because in Python `math.nan != math.nan`. It is one of the violations of the *reflexivity* contract an equality relation should have, but there are good reasons to do that here. – Willem Van Onsem Dec 18 '17 at 13:42
  • Is there a particular design reason for this? The result of that comparison is unintuitive. – blacksite Dec 18 '17 at 13:44
  • 1
    more than unintuitive, it even violates the reflexivity constraint. The reason is that for instance `math.nan + 2` is also `math.nan`, but can you say that `x + 2 == x`? – Willem Van Onsem Dec 18 '17 at 13:46
  • Please, when working with pandas, use `s.unique()` or `s.value_counts()`. – cs95 Dec 18 '17 at 13:48
  • 1
    I think this problem has been answered in various guises before (e.g. [here](https://stackoverflow.com/questions/41723419) and [here](https://stackoverflow.com/questions/47721635)). Basically Python creates new NaN objects when it iterates over the Series (like you point out) and Python tests for equal objects by id first, before falling back to `==`. – Alex Riley Dec 18 '17 at 13:51
  • In your last snippet `(id(x[0]), id(x[1]))` you are comparing `nan` and `1.0` because `x[0]` is `nan` and `x[1]` is `1.0`. – godaygo Dec 18 '17 at 14:05
  • @godaygo They would be different if both x[0] and x[1] referred to np.nan as well. Try with `arr = np.array([np.nan, np.nan])`, `id(arr[0])` and `id(arr[1])`. You'll get different id's. – ayhan Dec 18 '17 at 14:37

1 Answers1

3

Like in Java, and JavaScript, nan in numpy does not equal itself.

>>> np.nan == np.nan
False

This means when the set constructor checks "do I have an instance of nan in this set yet?" it alwasy returns False

So… why?

nan in both cases means "value which cannot be represented by 'float'". This means an attempt to convert it to float necessarily fails. It's also unable to be sorted, because there's no way to tell if nan is supposed to be larger or smaller than any number.

After all, which is bigger "cat" or 7? And is "goofy" == "pluto"?

SO… what do I do?

There are a couple of ways to resolve this problem. Personally, I generally try to fill nan before processing: DataFrame.fillna will help with that, and I would always use df.unique() to get a set of unique values.

no_nas = s.dropna().unique()
with_nas = s.unique()
with_replaced_nas = s.fillna(-1).unique() # using a placeholder

(note: all of the above can be passed into the set constructor.

What if I don't want to use the Pandas way?

There are reasons not to use Pandas, or to rely on native objects instead of Pandas. These should suffice.

Your other option is to filter and remove the nan.

unqs = set(item for item in s if not np.isnan(item))

You could also replace things inline:

placeholder = '{placeholder}' # There are a variety of placeholder options.
unqs = set(item if not np.isnan(item) else placeholder for item in s)
cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
  • 1
    `"SO... what do I do?"` No, you really should be using pandas functions that are meant for things like this. Example, `s.dropna().unique()`, or `s.value_counts().index`, and so on. – cs95 Dec 18 '17 at 14:16
  • 1
    Absolutely agree. Added the generally preferred Pandas approach, but since I don't know OP's original needs, I've added an additional section. – cwallenpoole Dec 18 '17 at 14:26
  • 1
    Looks better now :-) – cs95 Dec 18 '17 at 14:27