Weird null checking behaviour by pd.notnull

Question

This is essentially a rehashing of the content of my answer here.

I came across some weird behaviour when trying to solve this question, using pd.notnull.

Consider

x = ('A4', nan)

I want to check which of these items are null. Using np.isnan directly will throw a TypeError (but I've figured out how to solve that).

Using pd.notnull does not work.

>>> pd.notnull(x)
True

It treats the tuple as a single value (rather than an iterable of values). Furthermore, converting this to a list and then testing also gives an incorrect answer.

>>> pd.notnull(list(x))
array([ True,  True])

Since the second value is nan, the result I'm looking for should be [True, False]. It finally works when you pre-convert to a Series:

>>> pd.Series(x).notnull() 
0     True
1    False
dtype: bool

So, the solution is to Series-ify it and then test the values.

Along similar lines, another (admittedly roundabout) solution is to pre-convert to an object dtype numpy array, and pd.notnull or np.isnan will work directly:

>>> pd.notnull(np.array(x, dtype=object))
Out[151]: array([True,  False])

I imagine that pd.notnull directly converts x to a string array under the covers, rendering the NaN as a string "nan", so it is no longer a "null" value.

Is pd.notnull doing the same thing here? Or is there something else going on under the covers that I should be aware of?

Notes

In [156]: pd.__version__
Out[156]: '0.22.0'

What version of `pandas` do you use? In v. 23.0 `pd.notnull(list(x))` returns correct result: `array([ True, False])` — Grigoriy Mikhalkin, Jun 26 '18 at 06:15
@GrigoriyMikhalkin 0.22. If this problem does not exist on 0.23, then it's certainly a bug that was fixed. Interesting. — cs95, Jun 26 '18 at 06:16
Tested in `pandas 0.23.1` - `x = ('A4', np.nan) print(pd.notnull(list(x)))` and return `[ True False]` — jezrael, Jun 26 '18 at 06:22
@coldspeed seems like this is the issue related to this behavior: https://github.com/pandas-dev/pandas/issues/20675 — Grigoriy Mikhalkin, Jun 26 '18 at 06:34
@GrigoriyMikhalkin If you're upto it, write a few words as an answer and I'll be happy to mark it. — cs95, Jun 26 '18 at 06:49

score 3 · Accepted Answer · answered Jun 26 '18 at 07:14

Here is the issue related to this behavior: https://github.com/pandas-dev/pandas/issues/20675.

In short, if argument passed to notnull is of type list, internally it is converted to np.array with np.asarray method. This bug occured, because, if no dtype specified, numpy converts np.nan to string(which is not recognized by pd.isnull as null value):

a = ['A4', np.nan]
np.asarray(a)
# array(['A4', 'nan'], dtype='<U3')

This problem was fixed in version 0.23.0, by calling np.asarray with dtype=object.

Weird null checking behaviour by pd.notnull

1 Answers1

Linked