0

Why does the value None convert to both True and False in this series?

Env:

  • Jupyter Notebok 6.0.3 in Jupyter Labs
  • Python 3.7.6

Imports:

from IPython.display import display
import pandas as pd

Converts None to True:

df_test1 = pd.DataFrame({'test_column':[0,1,None]})
df_test1['test_column'] = df_test1.test_column.astype(bool)
display(df_test1)

enter image description here

Converts None to False:

df_test2 = pd.DataFrame({'test_column':[0,1,None,'test']})
df_test2['test_column'] = df_test2.test_column.astype(bool)
display(df_test2)

enter image description here

Is this expected behavior?

blehman
  • 1,870
  • 7
  • 28
  • 39
  • 1
    `bool(np.nan)` is what your first case is whereas `bool(None)` is your second example, try `df_test1['prime_member'].apply(type)` v/s `df_test2['prime_member'].apply(type)` – anky Feb 05 '21 at 16:56
  • 2
    Good read : [What is the difference between NaN and None?](https://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none) – anky Feb 05 '21 at 17:01

2 Answers2

8

Yes, this is expected behaviour, it leads from the initial dtype storage type of each series (column). The first input results in a series with floating point numbers, the second contains references to Python objects:

>>> pd.Series([0,1,None]).dtype
dtype('float64')
>>> pd.Series([0,1,None,'test']).dtype
dtype('O')

The float version of None is NaN, or Not a Number, which converts to True when interpreted as a boolean (as it is not equal to 0):

>>> pd.Series([0,1,None])[2]
nan
>>> bool(pd.Series([0,1,None])[2])
True

In the other case, the original None object was preserved, which converts to False:

>>> pd.Series([0,1,None,'test'])[2] is None
True
>>> bool(None)
False

So this comes down to automatic type inference, what type Pandas thinks is best suited for each column; see the DataFrame.infer_objects() method. The goal is to minimise storage requirements and operation performance; storing numbers as native 64-bit floating point values leads to faster numeric operations and a smaller memory footprint, while at the same time still being able to represent 'missing' values as NaN.

However, when you pass in a mix of numbers and strings, Panda's can't use a dedicated specialised array type and so falls back to the "Python object" type, which are references to the original Python objects.

Instead of letting Pandas guess as to what type you need, you could explicitly specify the type to be used. You could use one of the nullable integer types (which use Pandas.NA instead of NaN); converting these to booleans results in missing values converting to False:

>>> pd.Series([0,1,None], dtype=pd.Int64Dtype).astype(bool)
0    False
1     True
2    False
dtype: bool

Another option is to convert to a nullable boolean type, and so preserve the None / NaN indicators of missing data:

>>> pd.Series([0,1,None]).astype("boolean")
0    False
1     True
2     <NA>
dtype: boolean

Also see Working with missing data section in the user manual, as well as the nullable integer and nullable boolean data type manual pages.

Note that the Pandas notion of the NA value, representing missing data, is still considered experimental, which is why it is not yet the default. But if you want to 'opt in' for dataframes you just created, you can call the DataFrame.convert_dtypes() method right after creating the frame:

>>> df = pd.DataFrame({'prime_member':[0,1,None]}).convert_dtypes()
>>> df.prime_member
0       0
1       1
2    <NA>
Name: prime_member, dtype: Int64
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
4

The differences are datatypes df_test1.prime_member.dtype is float64, and you don't have None, but NaN. Now, bool(np.nan) is True.

However, when you have mixed type column: df_test2.prime_member.dtype is object. Then None remains None in the data, and bool(None) is False.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74