88

I am importing study data into a Pandas data frame using read_csv.

My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").

When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.

Is there a way to import this column unchanged maybe as a string?

I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.

TylerH
  • 20,799
  • 66
  • 75
  • 101
user1802883
  • 881
  • 1
  • 7
  • 3

7 Answers7

87

As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.

converters={'column_name': str}

Let's say I have csv file projects.csv like below:

project_name,project_id
Some Project,000245
Another Project,000478

As for example below code is trimming leading zeros:

from pandas import read_csv

dataframe = read_csv('projects.csv')
print dataframe

Result:

      project_name  project_id
0     Some Project         245
1  Another Project         478

Solution code example:

from pandas import read_csv

dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe

Required result:

      project_name project_id
0     Some Project     000245
1  Another Project     000478

To have all columns as str:

pd.read_csv('sample.csv', dtype=str)

To have certain columns as str:

# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
wjandrea
  • 28,235
  • 9
  • 60
  • 81
baltasvejas
  • 1,698
  • 2
  • 17
  • 18
  • str as class in `dict_dtypes = {x : str for x in lst_str_cols}` but I think `'str'` still works – gregV Aug 29 '22 at 15:18
35

here is a shorter, robust and fully working solution:

simply define a mapping (dictionary) between variable names and desired data type:

dtype_dic= {'subject_id': str, 
            'subject_number' : 'float'}

use that mapping with pd.read_csv():

df = pd.read_csv(yourdata, dtype = dtype_dic)

et voila!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
16

If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:

df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file

You could also do:

df = pd.read_csv("your_file.csv", dtype=str)

By doing this you will have all your columns as strings and you won't lose any leading zeros.

Erick Rodriguez
  • 208
  • 1
  • 3
  • 9
6

You Can do This , Works On all Versions of Pandas

pd.read_csv('filename.csv', dtype={'zero_column_name': object})
wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • Also, If you need to do this for all columns you can do `pd.read_csv('filename.csv', dtype=object)` so you won't have to account for every column – John Franke Oct 09 '22 at 19:09
2

You can use converters to convert number to fixed width if you know the width.

For example, if the width is 5, then

data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})

This will do the trick. It works for pandas==0.23.0 and also read_excel.

Python3.6 or higher required.

secsilm
  • 381
  • 5
  • 16
1

As an example, consider the following my_data.txt file:

id,A
03,5
04,6

To preserve the leading zeros for the id column:

df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
   id  A
0  03  5
1  04  6
Andre Nevares
  • 711
  • 6
  • 21
bigwindlee
  • 11
  • 2
0

I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.

EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.

root
  • 76,608
  • 25
  • 108
  • 120
  • the features in that issue are done on the c-parser branch now and should be coming in 0.10. I just made a quick for issue #2184 and will be included in 0.9.1 coming up real soon. But yes, using dtypes should be the preferred behavior here so just keep a lookout for 0.10 in like a month or so. – Chang She Nov 06 '12 at 17:13
  • you should be able to make it work now if you upgrade to the latest on github master (i.e., using a converter) – Chang She Nov 06 '12 at 17:14
  • @ChangShe thanks, with the latest github version my converter work indeed! Looking forward to 0.10 for a cleaner solution though... – user1802883 Nov 07 '12 at 10:30
  • Wes Mckinney's blog page is 404. – MERose Jul 04 '16 at 12:14