2

I am searching for the boolean equivalent of pandas to_numeric() I want the function to convert a column to True/False/nan if at all possible, and if not throw an error.

My motivation is that I need to auto recognize and convert boolean columns in a dataset with ~1000 columns. I can do a similar things with floats/integers using the following code:

df = df_raw.apply(pd.to_numeric, errors='ignore')
Selah
  • 7,728
  • 9
  • 48
  • 60
  • Are you wanting to convert strings that look like `'True'` and `'False'` to `True` and `False` then determine if that entire column is nothing but `True` and `False`? – piRSquared May 04 '17 at 17:52

4 Answers4

6

Since pd.to_numeric is primarily used to convert strings to numeric values, I'm going to work under the assumption that you want to convert strings of literal boolean values.

Consider the dataframe df

df = pd.DataFrame([
        ['1', None, 'True'],
        ['False', 2, True]
    ])

print(df)

       0    1     2
0      1  NaN  True
1  False  2.0  True

My Choice
This is what I'd propose. Further below, I break it down in an attempt to explain what is going on.

def try_eval2(x):
    if type(x) is str:
        try:
            x = literal_eval(x)
        except:
            x = np.nan

    if type(x) is not bool:
        x = np.nan

    return x

vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)

       0    1     2
0    NaN  NaN  True
1  False  NaN  True

Timing
You'll notice that my proposed solution is pretty fast

%%timeit
vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
10000 loops, best of 3: 149 µs per loop

%timeit df.astype(str).applymap(to_boolean)
1000 loops, best of 3: 1.28 ms per loop

%timeit df.astype(str).stack().map({'True':True, 'False':False}).unstack()
1000 loops, best of 3: 1.27 ms per loop

Explanation

Step 1
Now I'll create a simple function using ast.literal_eval to convert strings to values

from ast import literal_eval

def try_eval(x):
    try:
        x = literal_eval(x)
    except:
        pass
    return x

Step 2
applymap with my new function. It's going to look the same!

d1 = df.applymap(try_eval)
print(d1)

       0    1     2
0      1  NaN  True
1  False  2.0  True

Step 3
Use where and applymap again to find where values are actually bool

d2 = d1.where(d1.applymap(type).eq(bool))
print(d2)

       0   1     2
0    NaN NaN  True
1  False NaN  True

Step 4
You can drop columns with all NaN

print(d2.dropna(1, 'all'))

       0     2
0    NaN  True
1  False  True
Community
  • 1
  • 1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I think I checked this later and found that they approach equivalent timing. I'll try to circle back and show it. – piRSquared May 06 '17 at 07:10
5

You need replace with where where replace to NaN all not boolean:

df = df.replace({'True':True,'False':False})
df = df.where(df.applymap(type) == bool)

Old solution (very slow):

You can astype to strings if some boolean in df, applymap with custom function and ast.literal_eval for converting:

from ast import literal_eval

def to_boolean(x):
    try:
        x = literal_eval(x)
        if type(x) == bool:
            return x
        else:
            return np.nan
    except:
        x = np.nan
    return x

print (df.astype(str).applymap(to_boolean))
#with borrowing sample from piRSquared
       0   1     2
0    NaN NaN  True
1  False NaN  True

Timings:

In [76]: %timeit (jez(df))
1 loop, best of 3: 488 ms per loop

In [77]: %timeit (jez2(df))
1 loop, best of 3: 527 ms per loop

#piRSquared fastest solution
In [78]: %timeit (pir(df))
1 loop, best of 3: 5.42 s per loop

#maxu solution
In [79]: %timeit df.astype(str).stack().map({'True':True, 'False':False}).unstack()
1 loop, best of 3: 1.88 s per loop

#jezrael ols solution
In [80]: %timeit df.astype(str).applymap(to_boolean)
1 loop, best of 3: 13.3 s per loop

Code for timings:

df = pd.DataFrame([
        ['True', False, '1', 0, None, 5.2],
        ['False', True, '0', 1, 's', np.nan]])

#[20000 rows x 60 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df = pd.concat([df]*10, axis=1).reset_index(drop=True)
df.columns = pd.RangeIndex(len(df.columns))
#print (df)

def to_boolean(x):
    try:
        x = literal_eval(x)
        if type(x) == bool:
            return x
        else:
            return np.nan
    except:
        x = np.nan
    return x


def try_eval2(x):
    if type(x) is str:
        try:
            x = literal_eval(x)
        except:
            x = np.nan

    if type(x) is not bool:
        x = np.nan

    return x

def pir(df):
    vals = df.values
    v = vals.ravel()
    a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
    df2 = pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
    return (df2)

def jez(df):
    df = df.replace({'True':True,'False':False})
    df = df.where(df.applymap(type) == bool)
    return (df)

def jez2(df):
    df = df.replace({'True':True,'False':False})
    df = df.where(df.applymap(type).eq(bool))
    return (df)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

astype is a more specific version of pd.to_numeric:

df = df_raw.astype('bool') 
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • This is what I was thinking... However, `bool(3.14)` is `True`... Is that what OP wanted? I'm not sure. Plus One because you rock and this is useful. – piRSquared May 04 '17 at 17:53
  • Yeah, I think I didn't read the question carefully. Thank you. :) – ayhan May 04 '17 at 17:54
3

I used @piRSquared's sample DF:

In [39]: df
Out[39]:
       0    1     2
0      1  NaN  True
1  False  2.0  True

In [40]: df.astype(str).stack().map({'True':True, 'False':False}).unstack()
Out[40]:
       0    1     2
0    NaN  NaN  True
1  False  NaN  True
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419