Pandas: Selecting NaN values using np.nan

Question

So I found out that the float NaN apparently doesn't equal itself. My question is how to deal with it. Let's start with a dataframe:

DF = pd.DataFrame({'X':[0, 3, None]})
DF
      X
0   0.0
1   3.0
2   NaN

DF['test1'] = np.where(DF['X'] == np.nan, 1, 0)
DF['test2'] = np.where(DF['X'].isin([np.nan]), 1, 0)
DF
      X test1 test2
0   0.0     0     0
1   3.0     0     0
2   NaN     0     1

So test1 and test2 aren't the same. Many others have mentioned that we should use pd.isnull(). My question is, is it safe to just use isin()? For example, if I need to create a new column using np.where, can I simply do:

DF['test3'] = np.where(DF['X'].isin([0, np.nan]), 1, 0)

Or should I always use pd.isnull like so:

DF['test3'] = np.where((DF['X'] == 0) | (pd.isnull(DF['X'])), 1, 0)

The easiest way to know would be to use [`timeit`](https://docs.python.org/3/library/timeit.html). In my contrived case of a DF with 10000 entries, your first method is ~10% faster but it might be affected by several factors specific to your problem. — roganjosh, Sep 19 '17 at 17:35

score 1 · Accepted Answer · answered Sep 19 '17 at 17:39

You should always use pd.isnull or np.isnan if you suspect there could be nans.

For example suppose you have an object-dtype column (unfortunately these aren't uncommon):

Then using isin won't give you correct results:

>>> df['X'].isin([np.nan])
0    False
1    False
2    False
Name: X, dtype: bool

While isnull still works correctly:

>>> df['X'].isnull()
0    False
1    False
2     True
Name: X, dtype: bool

Given that NaN support isn't explicitly mentioned in Series.isin nor DataFrame.isin it might just be an implementation detail that it correctly "finds" NaNs. And implementation details are always bad to rely on. They could change anytime...

Aside from this, it always pays off to be explicit. An explicit isnull or isnan check should be (in my opinion) preferred.

Pandas: Selecting NaN values using np.nan

1 Answers1