1

I want to import data from a csv file by pandas.read_csv(). A type of my data is string with " "(but these strings are numbers indicate the categories). I found that pandas fail to infer strings as "object" type, it infer them as int64. see the examples below:

a.csv:

uid, f_1, f_2
1,   "1", 1.1
2,   "2", 2.3
3,   "0", 4.8

pandas.read_csv('a.csv').dtypes gives the following output:

uid:int64
f_1:int64
f_2:float64

The type of f_1 was infered to 'int64' rather than 'object'.

However, if I replace all the " in a.csv to ', then the f_1 could be correctly infered into 'object'. How can I prevent the wrong inference without modifying 'a.csv'? Another question is that why pandas infers strings as 'object' type rather than 'str' type?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Qin Zhou
  • 43
  • 5
  • For the latter, numpy doesn't have a str type only a str of particular length. Since usually columns are mixed length, pandas uses object for strings for efficiency of data storage. The former might be a bug, please post it on github: https://www.github.com/pydata/pandas/issues – Andy Hayden Dec 20 '15 at 07:29
  • Welcome to Stackoverflow. You can check [tour](http://stackoverflow.com/tour), how site works. – jezrael Dec 20 '15 at 13:29

2 Answers2

1

I think you need add parameter skipinitialspace in read_csv:

skipinitialspace : boolean, default False, Skip spaces after delimiter

Test:

import pandas as pd
import numpy as np
import io


temp=u"""uid, f_1, f_2
1,  "1", 1.19
2,  "2", 2.3
3,  "0", 4.8"""

print pd.read_csv(io.StringIO(temp))
   uid    f_1   f_2
0    1    "1"  1.19
1    2    "2"  2.30
2    3    "0"  4.80

#doesn't work dtype    
print pd.read_csv(io.StringIO(temp), dtype= {'f_1': np.int64}).dtypes
uid       int64
 f_1     object
 f_2    float64
dtype: object

print pd.read_csv(io.StringIO(temp), skipinitialspace=True).dtypes
uid      int64
f_1      int64
f_2    float64
dtype: object

If you want remove first and last char " from column f_1 use converters:

import pandas as pd
import io


temp=u"""uid, f_1, f_2
1,  "1", 1.19
2,  "2", 2.3
3,  "0", 4.8"""

print pd.read_csv(io.StringIO(temp))
   uid    f_1   f_2
0    1    "1"  1.19
1    2    "2"  2.30
2    3    "0"  4.80

#remove "
def converter(x):
    return x.strip('"')

#define each column
converters={'f_1': converter}

df = pd.read_csv(io.StringIO(temp), skipinitialspace=True, converters = converters)
print df
   uid f_1   f_2
0    1   1  1.19
1    2   2  2.30
2    3   0  4.80
print df.dtypes
uid      int64
f_1     object
f_2    float64
dtype: object

If you need convert integer column f_1 to string use dtype:

import pandas as pd
import io


temp=u"""uid, f_1, f_2
1,  1, 1.19
2,  2, 2.3
3,  0, 4.8"""

print pd.read_csv(io.StringIO(temp)).dtypes
uid       int64
 f_1      int64
 f_2    float64
dtype: object

df = pd.read_csv(io.StringIO(temp), skipinitialspace=True, dtype = {'f_1' : str })

print df
   uid f_1   f_2
0    1   1  1.19
1    2   2  2.30
2    3   0  4.80
print df.dtypes
uid      int64
f_1     object
f_2    float64
dtype: object

Notice: Don't forget change io.StringIO(temp) to a.csv.

And explaining str vs object is here.

Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for your work! I found it very strange that when I use your tiny samples, it works, but when pandas tries to load a larger file(in my case, 200MB or more), it still failed even if I tried `skipinitialspace=True`. But the `dtype=...` works, so anyway, thanks for your suggestions. – Qin Zhou Dec 21 '15 at 02:59
0

You can force inference in the read_csv call by providing a column name or dictionary in the dtype optional parameter, see the pandas documentation on read_csv.

alex314159
  • 3,159
  • 2
  • 20
  • 28