Pandas read_csv dtype specify all columns but one

Question

I've a CSV file. Most of it's values I want to read as string, but I want to read a column as bool if the column with the given title exists..

Because the CSV file has a lots of columns, I don't want to specify on each column the datatype directly and give something like this:

data = read_csv('sample.csv', dtype={'A': str, 'B': str, ..., 'X': bool})

Is it possible to define the string type on each column but one and read an optional column as a bool at the same time?

My current solution is the following (but it's very unefficient and slow):

data = read_csv('sample.csv', dtype=str) # reads all column as string
if 'X' in data.columns:
    l = lambda row: True if row['X'] == 'True' else False if row['X'] == 'False' else None
    data['X'] = data.apply(l, axis=1)

UPDATE: Sample CSV:

A;B;C;X
a1;b1;c1;True
a2;b2;c2;False
a3;b3;c3;True

Or the same can ba without the 'X' column (because the column is optional):

A;B;C
a1;b1;c1
a2;b2;c2
a3;b3;c3

jezrael · Accepted Answer · 2016-05-29T23:56:57.927

You can first filter columns contains value X with boolean indexing and then replace:

cols = df.columns[df.columns.str.contains('X')]
df[cols] = df[cols].replace({'True': True, 'False': False})

Or if need filter column X:

cols = df.columns[df.columns == 'X']
df[cols] = df[cols].replace({'True': True, 'False': False})

Sample:

import pandas as pd

df = pd.DataFrame({'A':['a1','a2','a3'],
                   'B':['b1','b2','b3'],
                   'C':['c1','c2','c3'],
                   'X':['True','False','True']})

print (df)
    A   B   C      X
0  a1  b1  c1   True
1  a2  b2  c2  False
2  a3  b3  c3   True

print (df.dtypes)
A    object
B    object
C    object
X    object
dtype: object

cols = df.columns[df.columns.str.contains('X')]
print (cols)

Index(['X'], dtype='object')

df[cols] = df[cols].replace({'True': True, 'False': False})

print (df.dtypes)
A    object
B    object
C    object
X      bool
dtype: object
print (df)

    A   B   C      X
0  a1  b1  c1   True
1  a2  b2  c2  False
2  a3  b3  c3   True

That solution has a very big problem.. It sets the False values to True values.. Any string with not zero length is parsed as True.. — elaspog, May 29 '16 at 23:54

score 1 · Answer 2 · answered May 29 '16 at 23:38

1

why not use bool() data type. bool() evaluates to true if a parameter is passed and the parameter is not False, None, '', or 0

if 'X' in data.columns:
    try:
        l = bool(data.columns['X'].replace('False', 0))
    except:
        l = None
    data['X'] = data.apply(l, axis=1)

answered May 29 '16 at 23:38

TheLazyScripter

2,541
1
10
19

I think applying the lambda expression slows down processing time. – elaspog May 29 '16 at 23:43
1

Yes most definitely. If I were going to approach it the way you did originally then I would use an if/else statement. `if data == 'False': l = False` `elif data == 'True': l = True` `else: l = None` but I would use on of the answers that were provided. Assuming you have a large dataset I would time both solutions and use the fastest. Good luck – TheLazyScripter May 29 '16 at 23:48

score 0 · Answer 3 · answered May 29 '16 at 23:56

0

Actually you don't need any special handling when using read_csv from pandas (tested on version 0.17). Using your example file with X:

import pandas as pd

df = pd.read_csv("file.csv", delimiter=";")
print(df.dtypes)

A    object
B    object
C    object
X      bool
dtype: object

answered May 29 '16 at 23:56

gbakie

9
3

I wouldn't need it by default, but my data is very special (it's mixes alphanumeric numbers). That's one of the reasons why I have this limitation on some columns to keep special numeric numbers in string format. – elaspog May 29 '16 at 23:59

score 0 · Answer 4 · answered May 07 '21 at 20:09

For those looking for an answer to the question in the title, (in this case, set all to string except for the index as int) you can do something like this, if you know the amount of columns you have:

dtype = dict(zip(range(9),np.int16 + [str for _ in range(8)]))
dframe = pd.read_csv('../files/file.csv', dtype=dtype)

Credit to Anton vBR in this question.

Pandas read_csv dtype specify all columns but one

4 Answers4

Linked