Like in Java, and JavaScript, nan
in numpy does not equal itself.
>>> np.nan == np.nan
False
This means when the set
constructor checks "do I have an instance of nan
in this set yet?" it alwasy returns False
So… why?
nan
in both cases means "value which cannot be represented by 'float'". This means an attempt to convert it to float necessarily fails. It's also unable to be sorted, because there's no way to tell if nan
is supposed to be larger or smaller than any number.
After all, which is bigger "cat" or 7? And is "goofy" == "pluto"?
SO… what do I do?
There are a couple of ways to resolve this problem. Personally, I generally try to fill nan before processing: DataFrame.fillna
will help with that, and I would always use df.unique()
to get a set of unique values.
no_nas = s.dropna().unique()
with_nas = s.unique()
with_replaced_nas = s.fillna(-1).unique() # using a placeholder
(note: all of the above can be passed into the set
constructor.
What if I don't want to use the Pandas way?
There are reasons not to use Pandas, or to rely on native objects instead of Pandas. These should suffice.
Your other option is to filter and remove the nan
.
unqs = set(item for item in s if not np.isnan(item))
You could also replace things inline:
placeholder = '{placeholder}' # There are a variety of placeholder options.
unqs = set(item if not np.isnan(item) else placeholder for item in s)