42

Some column in dataframe df, df.column, is stored as datatype int64.

The values are all 1s or 0s.

Is there a way to replace these values with boolean values?

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
user1893148
  • 1,990
  • 3
  • 24
  • 34

3 Answers3

78
df['column_name'] = df['column_name'].astype('bool')

For example:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_integers(0,1,size=5), 
                  columns=['foo'])
print(df)
#    foo
# 0    0
# 1    1
# 2    0
# 3    1
# 4    1

df['foo'] = df['foo'].astype('bool')
print(df)

yields

     foo
0  False
1   True
2  False
3   True
4   True

Given a list of column_names, you could convert multiple columns to bool dtype using:

df[column_names] = df[column_names].astype(bool)

If you don't have a list of column names, but wish to convert, say, all numeric columns, then you could use

column_names = df.select_dtypes(include=[np.number]).columns
df[column_names] = df[column_names].astype(bool)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 2
    How to let pandas detect this automatically? If there are only 0 and 1.. then make it boolean? – Joop May 09 '17 at 19:31
  • How to do this for all applicable columns? – jtlz2 Jan 09 '18 at 09:44
  • 2
    Tried `df['column_name'] = df['column_name'].astype('bool')`. The **boolean** value is defaulted to `True`. How to default the **boolean** as `False`? – Love Putin Not War Jun 11 '20 at 05:07
  • @user12379095 I solved the problem using a simple converter on the column like lambda x: x if x else 0 Maybe not the most efficient way but it works. – M. Hardy Dec 16 '21 at 18:05
4

There are various ways to achieve that, below one will see various options:

  1. Using pandas.Series.map

  2. Using pandas.Series.astype

  3. Using pandas.Series.replace

  4. Using pandas.Series.apply

  5. Using numpy.where

As OP didn't specify the dataframe, in this answer I will be using the following dataframe

import pandas as pd

df = pd.DataFrame({'col1': [1, 0, 0, 1, 0], 'col2': [0, 0, 1, 0, 1], 'col3': [1, 1, 1, 0, 1], 'col4': [0, 0, 0, 0, 1]})

[Out]:

   col1  col2  col3  col4
0     1     0     1     0
1     0     0     1     0
2     0     1     1     0
3     1     0     0     0
4     0     1     1     1

We will consider that one wants to change to boolean only the values in col1. If one wants to transform the whole dataframe, see one of the notes below.

In the section Time Comparison one will measure the times of execution of each option.


Option 1

Using pandas.Series.map as follows

df['col1'] = df['col1'].map({1: True, 0: False})

[Out]:

    col1  col2  col3  col4
0   True     0     1     0
1  False     0     1     0
2  False     1     1     0
3   True     0     0     0
4  False     1     1     1

Option 2

Using pandas.Series.astype as follows

df['col1'] = df['col1'].astype(bool)

[Out]:

    col1  col2  col3  col4
0   True     0     1     0
1  False     0     1     0
2  False     1     1     0
3   True     0     0     0
4  False     1     1     1

Option 3

Using pandas.Series.replace, with one of the following options

# Option 3.1
df['col1'] = df['col1'].replace({1: True, 0: False})

# or

# Option 3.2
df['col1'] = df['col1'].replace([1, 0], [True, False])


[Out]:

    col1  col2  col3  col4
0   True     0     1     0
1  False     0     1     0
2  False     1     1     0
3   True     0     0     0
4  False     1     1     1

Option 4

Using pandas.Series.apply and a custom lambda function as follows

df['col1'] = df['col1'].apply(lambda x: True if x == 1 else False)

[Out]:

    col1  col2  col3  col4
0   True     0     1     0
1  False     0     1     0
2  False     1     1     0
3   True     0     0     0
4  False     1     1     1

Option 5

Using numpy.where as follows

import numpy as np

df['col1'] = np.where(df['col1'] == 1, True, False)

[Out]:

    col1  col2  col3  col4
0   True     0     1     0
1  False     0     1     0
2  False     1     1     0
3   True     0     0     0
4  False     1     1     1

Time Comparison

For this specific case one has used time.perf_counter() to measure the time of execution.

       method                   time
0    Option 1 0.00000120000913739204
1    Option 2 0.00000220000219997019
2  Option 3.1 0.00000179999915417284
3  Option 3.2 0.00000200000067707151
4    Option 4 0.00000400000135414302
5    Option 5 0.00000210000143852085

enter image description here


Notes:

  • There are strong opinions on using .apply(), so one might want to read this.

  • There are additional ways to measure the time of execution. For additional ways, read this: How do I get time of a Python program's execution?

  • To convert the whole dataframe, one can do, for example, the following

    df = df.astype(bool)
    
    [Out]:
    
        col1   col2   col3   col4
    0   True  False   True  False
    1  False  False   True  False
    2  False   True   True  False
    3   True  False  False  False
    4  False   True   True   True
    
Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
2

Reference: Stack Overflow unutbu (Jan 9 at 13:25), BrenBarn (Sep 18 2017)

I had numerical columns like age and ID which I did not want to convert to Boolean. So after identifying the numerical columns like unutbu showed us, I filtered out the columns which had a maximum more than 1.

# code as per unutbu
column_names = df.select_dtypes(include=[np.number]).columns 

# re-extracting the columns of numerical type (using awesome np.number1 :)) then getting the max of those and storing them in a temporary variable m.
m=df[df.select_dtypes(include=[np.number]).columns].max().reset_index(name='max')

# I then did a filter like BrenBarn showed in another post to extract the rows which had the max == 1 and stored it in a temporary variable n.
n=m.loc[m['max']==1, 'max']

# I then extracted the indexes of the rows from n and stored them in temporary variable p.
# These indexes are the same as the indexes from my original dataframe 'df'.
p=column_names[n.index]

# I then used the final piece of the code from unutbu calling the indexes of the rows which had the max == 1 as stored in my variable p.
# If I used column_names directly instead of p, all my numerical columns would turn into Booleans.
df[p] = df[p].astype(bool)
H4dr1en
  • 277
  • 2
  • 11
mel el
  • 481
  • 4
  • 6