0

I have a .xlsfile that looks like this

col_a       col_b   col_c   col_d
5376594                     hello
12028432                    world
17735732    hello   12      hello
17736843    world           world

when I read the file with

test = pandas.read_excel('F:/test.xls')

The table is read with the following column types:

>>> test.dtypes
col_a       int64
col_b       object
col_c       float64
col_d       object

The problem I have is that I would like to have string columns for col_b and col_d. Since I'm quite new at python, can you please point me to

  1. what is happening behind the scenes? and
  2. Is there any parameter to adjust to read the column as string?

EDIT: Types for the first row as asked in comment

>>> type(test.iloc[0]['col_a'])
<class 'numpy.int64'>
>>> type(test.iloc[0]['col_b'])
<class 'float'>
>>> type(test.iloc[0]['col_c'])
<class 'numpy.float64'>
>>> type(test.iloc[0]['col_d'])
<class 'str'>
Michele
  • 8,563
  • 6
  • 45
  • 72
  • It's probably still a `str` what does `type(test.iloc[0]['col_a'])` show? – EdChum Sep 08 '15 at 13:36
  • @EdChum please see edit – Michele Sep 08 '15 at 13:41
  • The reason `col_b` is `float` is because you have missing values and `NaN` values require float dtype to represent them – EdChum Sep 08 '15 at 13:42
  • 1
    So basically the `dtypes` are correct and appropriate for columns b and d you have mixed dtypes due to missing values – EdChum Sep 08 '15 at 13:55
  • @EdChum thanks. Is there a way to 'force' pandas to read as string? – Michele Sep 08 '15 at 14:01
  • `object` will be the displayed dtype here irrespective of whether you have missing values or not, this means python object for ints, floats, datetimes and bool will be displayed as dtype, object for all others this is correct behaviour – EdChum Sep 08 '15 at 14:04

2 Answers2

1

You can defined dtype in pandas.read_csv.

dtype: A data type name or a dict of column name to data type. If not specified, data types will be inferred. (Unsupported with engine='python')

And why NaN is float - here.
Types of dtypes are here (in the end of page).

Test:

import pandas
import io
import numpy

col_types = {"col_a": numpy.int32, "col_b": str, "col_c": str, "col_d": str}

temp=u"""col_a,col_b,col_c,col_d
5376594,,,hello
12028432,,,world
17735732,hello,12,hello
17736843,world,,world"""

test = pandas.read_csv(io.StringIO(temp), header=0, sep=",", dtype=col_types)



print type(test.iloc[0]['col_a'])
print type(test.iloc[0]['col_b'])
print type(test.iloc[0]['col_c'])
print type(test.iloc[0]['col_d'])
#
#<type 'numpy.int32'>
#<type 'float'>
#<type 'float'>
#<type 'str'>

print type(test.iloc[2]['col_a'])
print type(test.iloc[2]['col_b'])
print type(test.iloc[2]['col_c'])
print type(test.iloc[2]['col_d']).
#
#<type 'numpy.int32'>
#<type 'str'>
#<type 'str'>
#<type 'str'>

print test
print test.dtypes
#
#col_a     int32
#col_b    object
#col_c    object
#col_d    object
#dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

From the pandas documentation on pd.read_excel about the function's arguments:

dtypeType name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

You can also use the converters, as suggested in other answers, but I think you do'nt really need to, because converters apply some kind of conversion, as is shown in the documentation:

convertersdict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

So using converters would change for example a cell that is int and has a value of 1 to a cell that is float and has a value of 1.1

Object is a general data type, usually used for strings, although pandas has a more specific dtype for strings StringDtype. See the documentation for more info.

In the end, you can change your read_excel in the following way,

test = pandas.read_excel('F:/test.xls',
    dtype={'col_a': int, 'col_b': str,'col_c': float,'col_d': str,})

and that should work. Although it might leave col_b and col_d as object type, if that happens, you should try:

test = pandas.read_excel('F:/test.xls',
    dtype={'col_a': int, 'col_b': pd.StringDtype,'col_c': float,'col_d': pd.StringDtype,})

And that should return the str format you want.

Edit: Checking the StringDtype documentation, I saw this warning, so be careful

StringDtype is considered experimental. The implementation and parts of the API may change without warning.

you can also check this question for a detailed review of every data type pandas accepts