131

I am importing an excel file into a pandas dataframe with the pandas.read_excel() function.

One of the columns is the primary key of the table: it's all numbers, but it's stored as text (the little green triangle in the top left of the Excel cells confirms this).

However, when I import the file into a pandas dataframe, the column gets imported as a float. This means that, for example, '0614' becomes 614.

Is there a way to specify the datatype when importing a column? I understand this is possible when importing CSV files but couldn't find anything in the syntax of read_excel().

The only solution I can think of is to add an arbitrary letter at the beginning of the text (converting '0614' into 'A0614') in Excel, to make sure the column is imported as text, and then chopping off the 'A' in python, so I can match it to other tables I am importing from SQL.

Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112

8 Answers8

188

You just specify converters. I created an excel spreadsheet of the following structure:

names   ages
bob     05
tom     4
suzy    3

Where the "ages" column is formatted as strings. To load:

import pandas as pd

df = pd.read_excel('Book1.xlsx',sheetname='Sheet1',header=0,converters={'names':str,'ages':str})
>>> df
       names ages
   0   bob   05
   1   tom   4
   2   suzy  3
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
tnknepp
  • 5,888
  • 6
  • 43
  • 57
  • 1
    I had understood 'converters' specified a function to apply to the column. evidently I was wrong - thank you for pointing this out, it's very useful! – Pythonista anonymous Sep 15 '15 at 17:09
  • 5
    Where can I find the list of allowable converter functions? I see `str` here, but presumably there's `int` and a few more besides - is there a link anywhere to the source docs that enumerates the possible converter functions available? – Thomas Kimber Nov 01 '16 at 17:23
  • 1
    I have not found a list either. Since "converters" accepts functions, I suspect that your imagination is the limit, just so you keep within the bounds of the "converters" functionality (i.e. it was designed to use functions that require only one input variable!). – tnknepp Nov 01 '16 at 18:46
  • Oddly, when I set a column name to `str` in the `converters` dict and then print `df.dtypes`, the type for that column is set to `object` not `str`. Any ideas? Is it even important? – mhyousefi Sep 22 '18 at 18:35
  • 1
    @mhyousefi This is not important (on the surface at least). When setting column types as strings Pandas refers to them as objects. See HYRY's answer [here](https://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object) – tnknepp Sep 24 '18 at 10:04
96

Starting with v0.20.0, the dtype keyword argument in read_excel() function could be used to specify the data types that needs to be applied to the columns just like it exists for read_csv() case.

Using converters and dtype arguments together on the same column name would lead to the latter getting shadowed and the former gaining preferance.


1) Inorder for it to not interpret the dtypes but rather pass all the contents of it's columns as they were originally in the file before, we could set this arg to str or object so that we don't mess up our data. (one such case would be leading zeros in numbers which would be lost otherwise)

pd.read_excel('file_name.xlsx', dtype=str)            # (or) dtype=object

2) It even supports a dict mapping wherein the keys constitute the column names and values it's respective data type to be set especially when you want to alter the dtype for a subset of all the columns.

# Assuming data types for `a` and `b` columns to be altered
pd.read_excel('file_name.xlsx', dtype={'a': np.float64, 'b': np.int32})
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • 14
    This should be the accepted answer as "converters" seem to convert data AFTER reading it as a different type. This leads to information loss ("001" will be read as int("001")="1" and then converted to str. But "001" != "1") . At least that is what happended in my case, correct me if i'm wrong. – Schorsch Jun 06 '19 at 08:20
  • 1
    Yes, this is the more intuitive solution for me – Kris Stern Sep 09 '21 at 05:14
  • After trying many possibilities, this one worked for me. Thank you very much! – Peter Nov 05 '21 at 07:20
  • OMG! This is THE right answer! I had a stupid number that was 7E+19 giving me an overflow error when Python tried to put it in an SQL Integer, and no other method was able to deal with this. I needed to set `dtype=str` and that fixed the issue. – Nelson Jan 19 '23 at 07:30
13

In case if you are not aware of the number and name of columns in dataframe then this method can be handy:

column_list = []
df_column = pd.read_excel(file_name, 'Sheet1').columns
for i in df_column:
    column_list.append(i)
converter = {col: str for col in column_list} 
df_actual = pd.read_excel(file_name, converters=converter)

where column_list is the list of your column names.

Tango
  • 397
  • 5
  • 13
  • 2
    Just wonder if `df = df.astype(str)` would not be better (simpler). – Petr Matuska Jun 07 '18 at 07:53
  • 2
    Why do you create a list first? Maybe more efficient to use: `conv = {x:str for x in pd.read_excel(fn,sheet_name='sheet1').columns}` and then `df = pd.read_excel(fn,sheet_name='sheet1',converters=conv)` – Dylan_w May 07 '20 at 13:30
12

If you are able to read the excel file correctly and only the integer values are not showing up. you can specify like this.

df = pd.read_excel('my.xlsx',sheetname='Sheet1', engine="openpyxl", dtype=str)

this should change your integer values into a string and show in dataframe

Rajat Tyagi
  • 320
  • 5
  • 9
10

The read_excel() function has a converters argument, where you can apply functions to input in certain columns. You can use this to keep them as strings. Documentation:

Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

Example code:

pandas.read_excel(my_file, converters = {my_str_column: str})
Nix G-D
  • 757
  • 6
  • 8
  • If we are not aware of number of columns present in the sheet, is there any way to apply it to every column while reading? – Tango Jul 22 '17 at 19:27
  • 7
    Got the solution: `converters = {col: str for col in column_list} df = pd.read_excel('some_excelfile.xls', converters=converters)` – Tango Jul 22 '17 at 20:26
  • can you do it by index or do you need the name? e.g, i'm reading my file in without headers. – rrs Aug 24 '18 at 20:16
  • 1
    @rrs, you can just use an integer as the key instead of the column name. – Nix G-D Aug 25 '18 at 21:08
5

If you don't know the column names and you want to specify str data type to all columns:

table = pd.read_excel("path_to_filename")
cols = table.columns
conv = dict(zip(cols ,[str] * len(cols)))
table = pd.read_excel("path_to_filename", converters=conv)
HazimoRa3d
  • 517
  • 5
  • 12
  • 1
    Also It might be useful to add `nrows=1` in the first `read_excel` call to avoid having to read the whole excel table only to get the headers. – Nuno André Jan 03 '20 at 04:16
4

If your key has a fixed number of digits, you should probably store as text rather than as numeric data. You can use the converters argument or read_excel for this.

Or, if this does not work, just manipulate your data once it's read into your dataframe:

df['key_zfill'] = df['key'].astype(str).str.zfill(4)

  names   key key_zfill
0   abc     5      0005
1   def  4962      4962
2   ghi   300      0300
3   jkl    14      0014
4   mno    20      0020
jpp
  • 159,742
  • 34
  • 281
  • 339
1

converters or dtype won't always help. Especially for date/time and duration (ideally a mix of both...), post-processing is necessary. In such cases, reading the Excel file's content to a built-in type and create the DataFrame from that can be an option.

Here's an example file. The "duration" column contains duration values in HH:MM:SS and invalid values "-".

excel_sample

If the hour is less than 24, Excel formats the entry to a time, otherwise a duration. We want dtype timedelta for the whole column in the DataFrame. But pandas messes up the import:

import pandas as pd
df = pd.read_excel("path-to-file")

df.duration
# 0               12:30:00
# 1    1900-01-01 00:30:00
# 2                      -
# 3    1900-01-01 23:55:01
# Name: duration, dtype: object

[type(i) for i in df.duration]
# [datetime.time, datetime.datetime, str, datetime.datetime]

Now we have datetime.datetime and datetime.time objects, and it's difficult to get back duration (timedelta)! You could do it directly with a converter, but that does not make it less difficult.

Here, I found it to be actually easier to use the excel loader engine directly:

from openpyxl import load_workbook
wb = load_workbook('path-to-file')

sheet = wb['Tests'] # adjust sheet name, this is for the demo file
data = list(sheet.values) # a list of tuples, one tuple for each row
df = pd.DataFrame(data[1:], columns=data[0]) # first tuple is column names

df['duration']
# 0           12:30:00
# 1     1 day, 0:30:00
# 2                  -
# 3    1 day, 23:55:01
# Name: duration, dtype: object

[type(i) for i in df['duration']]
# [datetime.time, datetime.timedelta, str, datetime.timedelta]

So now we already have some timedelta objects! The conversion of the others to timedelta can be done as simple as

df['duration'] = pd.to_timedelta(df.duration.astype(str), errors='coerce')
df['duration']
# 0   0 days 12:30:00
# 1   1 days 00:30:00
# 2               NaT
# 3   1 days 23:55:01
# Name: duration, dtype: timedelta64[ns]
FObersteiner
  • 22,500
  • 8
  • 42
  • 72