5

I am trying to select the rows of df where the column label has value None. (It's value None I obtained from another function, not NaN)

Why does df[df['label'].isnull()] return the rows I wanted,

but df[df['label'] == None] returns Empty DataFrame Columns: [path, fanId, label, gain, order] Index: [] ?

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
raffa
  • 145
  • 2
  • 11

1 Answers1

7

As the comment above states, missing data in pandas is represented by a NaN, where NaN is a numerical value, i.e float type. However None is a Python NoneType, so NaN will not be equivalent to None.

In [27]: np.nan == None
Out[27]: False

In this Github thread they discuss further, noting:

This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts None and np.nan on an equal (though not-consistent with python, BUT consistent with numpy) footing.

This means when you do df[df['label'] == None], you're going elementwise checking if np.nan == np.nan, which we know is false.

In [63]: np.nan == np.nan
Out[63]: False

Additionally you should not do df[df['label'] == None] when you're applying Boolean indexing, using == for a NoneType is not best practice as PEP8 mentions:

Comparisons to singletons like None should always be done with is or is not, never the equality operators.

For example you could do tst.value.apply(lambda x: x is None), which yields the same outcome as .isnull(), illustrating how pandas treats these as NaNs. Note this is for the below tst dataframe example, where tst.value.dtypes is an object of which I've explicitly specified the NoneType elements.

There is a nice example in the pandas docs which illustrate this and it's effect.

For example if you have two columns, one of type float and the other object you can see how pandas deals with the None type in a nice way, notice for float it is using NaN.

In [32]: tst = pd.DataFrame({"label" : [1, 2, None, 3, None], "value" : ["A", "B", None, "C", None]})

Out[39]:
   label value
0    1.0     A
1    2.0     B
2    NaN  None
3    3.0     C
4    NaN  None

In [51]: type(tst.value[2])
Out[51]: NoneType

In [52]: type(tst.label[2])
Out[52]: numpy.float64

This post explains the difference between NaN and None really well, would definitely take a look at this.

RK1
  • 2,384
  • 1
  • 19
  • 36
  • Preferred is `.isnull()`, `==` does not work as in `pandas` it is equivalent to comparing `np.nan == np.nan` which will return *False*, as in my example above if you did `tst[tst.value == None]` you will get an empty data.frame – RK1 Oct 02 '19 at 17:18
  • Yes, I understand that, but *you stated* that `tst.value.apply(lambda x: x is None)` is a better alternative to `df['label'] == None`, which is definitely not true. – juanpa.arrivillaga Oct 02 '19 at 17:22
  • No, I was illustrating what good practices is when comparing singletons, and then gave an example, I said *could* not *should* when doing the comparison in `pandas`... – RK1 Oct 02 '19 at 17:25
  • But **why would you do that**? It would be *worse* than using `==`. – juanpa.arrivillaga Oct 02 '19 at 17:26
  • It's an illustration, I'm not saying you should do it, rather to show how the `NoneTypes` exist in the `pandas.DataFrame` however when doing a vectorized comparison it's going to yield the same result as `np.nan == np.nan` as that's how pandas treats the `None` – RK1 Oct 02 '19 at 17:31
  • Ahhhh yes. I see what you were trying to illustrate here, subtle point. – juanpa.arrivillaga Oct 02 '19 at 17:32