Some column in dataframe df, df.column, is stored as datatype int64.
The values are all 1s or 0s.
Is there a way to replace these values with boolean values?
Some column in dataframe df, df.column, is stored as datatype int64.
The values are all 1s or 0s.
Is there a way to replace these values with boolean values?
df['column_name'] = df['column_name'].astype('bool')
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_integers(0,1,size=5),
columns=['foo'])
print(df)
# foo
# 0 0
# 1 1
# 2 0
# 3 1
# 4 1
df['foo'] = df['foo'].astype('bool')
print(df)
yields
foo
0 False
1 True
2 False
3 True
4 True
Given a list of column_names
, you could convert multiple columns to bool
dtype using:
df[column_names] = df[column_names].astype(bool)
If you don't have a list of column names, but wish to convert, say, all numeric columns, then you could use
column_names = df.select_dtypes(include=[np.number]).columns
df[column_names] = df[column_names].astype(bool)
There are various ways to achieve that, below one will see various options:
Using pandas.Series.map
Using pandas.Series.astype
Using pandas.Series.replace
Using pandas.Series.apply
Using numpy.where
As OP didn't specify the dataframe, in this answer I will be using the following dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 0, 0, 1, 0], 'col2': [0, 0, 1, 0, 1], 'col3': [1, 1, 1, 0, 1], 'col4': [0, 0, 0, 0, 1]})
[Out]:
col1 col2 col3 col4
0 1 0 1 0
1 0 0 1 0
2 0 1 1 0
3 1 0 0 0
4 0 1 1 1
We will consider that one wants to change to boolean only the values in col1
. If one wants to transform the whole dataframe, see one of the notes below.
In the section Time Comparison one will measure the times of execution of each option.
Option 1
Using pandas.Series.map
as follows
df['col1'] = df['col1'].map({1: True, 0: False})
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 2
Using pandas.Series.astype
as follows
df['col1'] = df['col1'].astype(bool)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 3
Using pandas.Series.replace
, with one of the following options
# Option 3.1
df['col1'] = df['col1'].replace({1: True, 0: False})
# or
# Option 3.2
df['col1'] = df['col1'].replace([1, 0], [True, False])
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 4
Using pandas.Series.apply
and a custom lambda function as follows
df['col1'] = df['col1'].apply(lambda x: True if x == 1 else False)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 5
Using numpy.where
as follows
import numpy as np
df['col1'] = np.where(df['col1'] == 1, True, False)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Time Comparison
For this specific case one has used time.perf_counter()
to measure the time of execution.
method time
0 Option 1 0.00000120000913739204
1 Option 2 0.00000220000219997019
2 Option 3.1 0.00000179999915417284
3 Option 3.2 0.00000200000067707151
4 Option 4 0.00000400000135414302
5 Option 5 0.00000210000143852085
Notes:
There are strong opinions on using .apply()
, so one might want to read this.
There are additional ways to measure the time of execution. For additional ways, read this: How do I get time of a Python program's execution?
To convert the whole dataframe, one can do, for example, the following
df = df.astype(bool)
[Out]:
col1 col2 col3 col4
0 True False True False
1 False False True False
2 False True True False
3 True False False False
4 False True True True
Reference: Stack Overflow unutbu (Jan 9 at 13:25), BrenBarn (Sep 18 2017)
I had numerical columns like age and ID which I did not want to convert to Boolean. So after identifying the numerical columns like unutbu showed us, I filtered out the columns which had a maximum more than 1.
# code as per unutbu
column_names = df.select_dtypes(include=[np.number]).columns
# re-extracting the columns of numerical type (using awesome np.number1 :)) then getting the max of those and storing them in a temporary variable m.
m=df[df.select_dtypes(include=[np.number]).columns].max().reset_index(name='max')
# I then did a filter like BrenBarn showed in another post to extract the rows which had the max == 1 and stored it in a temporary variable n.
n=m.loc[m['max']==1, 'max']
# I then extracted the indexes of the rows from n and stored them in temporary variable p.
# These indexes are the same as the indexes from my original dataframe 'df'.
p=column_names[n.index]
# I then used the final piece of the code from unutbu calling the indexes of the rows which had the max == 1 as stored in my variable p.
# If I used column_names directly instead of p, all my numerical columns would turn into Booleans.
df[p] = df[p].astype(bool)