2

So I found out that the float NaN apparently doesn't equal itself. My question is how to deal with it. Let's start with a dataframe:

DF = pd.DataFrame({'X':[0, 3, None]})
DF
      X
0   0.0
1   3.0
2   NaN

DF['test1'] = np.where(DF['X'] == np.nan, 1, 0)
DF['test2'] = np.where(DF['X'].isin([np.nan]), 1, 0)
DF
      X test1 test2
0   0.0     0     0
1   3.0     0     0
2   NaN     0     1

So test1 and test2 aren't the same. Many others have mentioned that we should use pd.isnull(). My question is, is it safe to just use isin()? For example, if I need to create a new column using np.where, can I simply do:

DF['test3'] = np.where(DF['X'].isin([0, np.nan]), 1, 0)

Or should I always use pd.isnull like so:

DF['test3'] = np.where((DF['X'] == 0) | (pd.isnull(DF['X'])), 1, 0)
AdmiralWen
  • 701
  • 6
  • 16
  • 1
    Later is marginally fast usually. – Zero Sep 19 '17 at 16:53
  • The easiest way to know would be to use [`timeit`](https://docs.python.org/3/library/timeit.html). In my contrived case of a DF with 10000 entries, your first method is ~10% faster but it might be affected by several factors specific to your problem. – roganjosh Sep 19 '17 at 17:35

1 Answers1

1

You should always use pd.isnull or np.isnan if you suspect there could be nans.

For example suppose you have an object-dtype column (unfortunately these aren't uncommon):

     X
0    a
1    3
2  NaN

Then using isin won't give you correct results:

>>> df['X'].isin([np.nan])
0    False
1    False
2    False
Name: X, dtype: bool

While isnull still works correctly:

>>> df['X'].isnull()
0    False
1    False
2     True
Name: X, dtype: bool

Given that NaN support isn't explicitly mentioned in Series.isin nor DataFrame.isin it might just be an implementation detail that it correctly "finds" NaNs. And implementation details are always bad to rely on. They could change anytime...

Aside from this, it always pays off to be explicit. An explicit isnull or isnan check should be (in my opinion) preferred.

MSeifert
  • 145,886
  • 38
  • 333
  • 352