8

I am trying to figure out whether or not a column in a pandas dataframe is boolean or not (and if so, if it has missing values and so on).

In order to test the function that I created I tried to create a dataframe with a boolean column with missing values. However, I would say that missing values are handled exclusively 'untyped' in python and there are some weird behaviours:

> boolean = pd.Series([True, False, None])
> print(boolean)

0     True
1    False
2     None
dtype: object

so the moment you put None into the list, it is being regarded as object because python is not able to mix the types bool and type(None)=NoneType back into bool. The same thing happens with math.nan and numpy.nan. The weirdest things happen when you try to force pandas into an area it does not want to go to :-)

> boolean = pd.Series([True, False, np.nan]).astype(bool)
> print(boolean)
0     True
1    False
2     True
dtype: bool

So 'np.nan' is being casted to 'True'?

Questions:

  1. Given a data table where one column is of type 'object' but in fact it is a boolean column with missing values: how do I figure that out? After filtering for the non-missing values it is still of type 'object'... do I need to implement a try-catch-cast of every column into every imaginable data type in order to see the true nature of columns?

  2. I guess that there is a logical explanation of why np.nan is being casted to True but this is an unwanted behaviour of the software pandas/python itself, right? So should I file a bug report?

Fabian Werner
  • 957
  • 11
  • 19
  • https://stackoverflow.com/questions/15686318/why-do-not-a-number-values-equal-true-when-cast-as-boolean-in-python-numpy – BENY Aug 28 '19 at 13:29
  • but what you exactly are trying to do ? You want to cast - NaN and None to False ? If this dataframe is meant to store information about missing values - I would rather try to ensure that input is correct i.e. it's true/false, rather than cleaning dataframe afterwards – Grzegorz Skibinski Aug 28 '19 at 13:31
  • @GrzegorzSkibinski In this particular case (not speaking in all generality here) I want to cast boolean values to 0,1 and if there is a missing value then it should stay missing... – Fabian Werner Aug 28 '19 at 13:47
  • Ah, gotcha. How should NaN be interpreted then - as 0/1, or as NaN ? – Grzegorz Skibinski Aug 28 '19 at 13:49
  • 2
    @GrzegorzSkibinski The most comfortable possibility (I think) is that there should be one single object that is integrated cleanly into the typing tree of a programming language. It should be combinable with all other simple data types and "vectors" should keep their type, i.e. [True, NA, False] should still be of type boolean. Casting this to 0/1 would then result in [1, NA, 0]. – Fabian Werner Aug 28 '19 at 13:50
  • new(ish) pandas *nullable boolean dtype* might work? e.g. `astype('boolean')`, not `'bool'`. https://pandas.pydata.org/docs/user_guide/boolean.html – fantabolous Jun 06 '23 at 02:04
  • In `R` `NA` is of type `logical` by default, so missing values are preserved when converted to ~ boolean type. This is very useful, for example, with observational data (usually full of gaps) when wanting to ascertain whether data points meet a condition, e.g. rainy days or hours. – climatestudent Sep 01 '23 at 14:40

2 Answers2

3

Q1: I would start with combining

np.any(pd.isna(boolean))

to identify if there are any None Values in a column, and with

set(boolean)

You can identify, if there are only True, False and Nones inside. Combining with filtering (and if you prefer to also typcasting) you should be done.

Q2: see comment of @WeNYoBen

Sosel
  • 1,678
  • 1
  • 16
  • 31
  • Yes, that is more or less what I have done now: first remove missing values and then do a nonsense `test = column_without_missing.map({False: False, True: True})` and if the type of test is boolean then the original type was boolean as well... – Fabian Werner Aug 28 '19 at 13:48
0

I've hit the same problem. I came up with the following solution:

from pandas import Series
def is_boolean_series(col: Series):
    val = col[~col.isna()].iloc[0]
    return type(val) == bool
Dharman
  • 30,962
  • 25
  • 85
  • 135
Iyar Lin
  • 581
  • 4
  • 13