12

I've a CSV file. Most of it's values I want to read as string, but I want to read a column as bool if the column with the given title exists..

Because the CSV file has a lots of columns, I don't want to specify on each column the datatype directly and give something like this:

data = read_csv('sample.csv', dtype={'A': str, 'B': str, ..., 'X': bool})

Is it possible to define the string type on each column but one and read an optional column as a bool at the same time?

My current solution is the following (but it's very unefficient and slow):

data = read_csv('sample.csv', dtype=str) # reads all column as string
if 'X' in data.columns:
    l = lambda row: True if row['X'] == 'True' else False if row['X'] == 'False' else None
    data['X'] = data.apply(l, axis=1)

UPDATE: Sample CSV:

A;B;C;X
a1;b1;c1;True
a2;b2;c2;False
a3;b3;c3;True

Or the same can ba without the 'X' column (because the column is optional):

A;B;C
a1;b1;c1
a2;b2;c2
a3;b3;c3
elaspog
  • 1,635
  • 3
  • 21
  • 51

4 Answers4

2

You can first filter columns contains value X with boolean indexing and then replace:

cols = df.columns[df.columns.str.contains('X')]
df[cols] = df[cols].replace({'True': True, 'False': False})

Or if need filter column X:

cols = df.columns[df.columns == 'X']
df[cols] = df[cols].replace({'True': True, 'False': False})

Sample:

import pandas as pd

df = pd.DataFrame({'A':['a1','a2','a3'],
                   'B':['b1','b2','b3'],
                   'C':['c1','c2','c3'],
                   'X':['True','False','True']})

print (df)
    A   B   C      X
0  a1  b1  c1   True
1  a2  b2  c2  False
2  a3  b3  c3   True
print (df.dtypes)
A    object
B    object
C    object
X    object
dtype: object

cols = df.columns[df.columns.str.contains('X')]
print (cols)

Index(['X'], dtype='object')

df[cols] = df[cols].replace({'True': True, 'False': False})

print (df.dtypes)
A    object
B    object
C    object
X      bool
dtype: object
print (df)

    A   B   C      X
0  a1  b1  c1   True
1  a2  b2  c2  False
2  a3  b3  c3   True
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

why not use bool() data type. bool() evaluates to true if a parameter is passed and the parameter is not False, None, '', or 0

if 'X' in data.columns:
    try:
        l = bool(data.columns['X'].replace('False', 0))
    except:
        l = None
    data['X'] = data.apply(l, axis=1)
TheLazyScripter
  • 2,541
  • 1
  • 10
  • 19
  • I think applying the lambda expression slows down processing time. – elaspog May 29 '16 at 23:43
  • 1
    Yes most definitely. If I were going to approach it the way you did originally then I would use an if/else statement. `if data == 'False': l = False` `elif data == 'True': l = True` `else: l = None` but I would use on of the answers that were provided. Assuming you have a large dataset I would time both solutions and use the fastest. Good luck – TheLazyScripter May 29 '16 at 23:48
0

Actually you don't need any special handling when using read_csv from pandas (tested on version 0.17). Using your example file with X:

import pandas as pd

df = pd.read_csv("file.csv", delimiter=";")
print(df.dtypes)

A    object
B    object
C    object
X      bool
dtype: object
gbakie
  • 9
  • 3
  • I wouldn't need it by default, but my data is very special (it's mixes alphanumeric numbers). That's one of the reasons why I have this limitation on some columns to keep special numeric numbers in string format. – elaspog May 29 '16 at 23:59
0

For those looking for an answer to the question in the title, (in this case, set all to string except for the index as int) you can do something like this, if you know the amount of columns you have:

dtype = dict(zip(range(9),np.int16 + [str for _ in range(8)]))
dframe = pd.read_csv('../files/file.csv', dtype=dtype)

Credit to Anton vBR in this question.

Devyzr
  • 299
  • 5
  • 13