How to keep leading zeros in a column when reading CSV with Pandas?

Question

I am importing study data into a Pandas data frame using read_csv.

My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").

When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.

Is there a way to import this column unchanged maybe as a string?

I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.

Possible duplicate of [Pandas read\_csv dtype leading zeros](http://stackoverflow.com/questions/16929056/pandas-read-csv-dtype-leading-zeros) — firelynx, May 26 '16 at 07:30

score 87 · Answer 1 · edited Sep 09 '22 at 16:01

As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.

converters={'column_name': str}

Let's say I have csv file projects.csv like below:

project_name,project_id
Some Project,000245
Another Project,000478

As for example below code is trimming leading zeros:

from pandas import read_csv

dataframe = read_csv('projects.csv')
print dataframe

Result:

      project_name  project_id
0     Some Project         245
1  Another Project         478

Solution code example:

from pandas import read_csv

dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe

Required result:

      project_name project_id
0     Some Project     000245
1  Another Project     000478

To have all columns as str:

pd.read_csv('sample.csv', dtype=str)

To have certain columns as str:

# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)

str as class in `dict_dtypes = {x : str for x in lst_str_cols}` but I think `'str'` still works — gregV, Aug 29 '22 at 15:18

ℕʘʘḆḽḘ · Answer 2 · 2017-10-24T12:22:21.790

35

here is a shorter, robust and fully working solution:

simply define a mapping (dictionary) between variable names and desired data type:

dtype_dic= {'subject_id': str, 
            'subject_number' : 'float'}

use that mapping with pd.read_csv():

df = pd.read_csv(yourdata, dtype = dtype_dic)

et voila!

edited Oct 24 '17 at 12:22

answered Apr 29 '16 at 12:25

ℕʘʘḆḽḘ

18,566
34
128
235

1

you can also include many other datatypes, `float` and others. I believe this is the most pandasque solution – ℕʘʘḆḽḘ Nov 04 '16 at 18:23
2

query: in dtype_dic json, why is str without quotes but float in quotes? – Nikhil VJ Apr 06 '18 at 02:46
I had to loop through different CSVs with different columns. This function took all the column mappings and didn't error out when a column wasn't there in the table. So I was able to define all the columns (to be read as string) in all the different tables in just one `dtype_dic` and use it for all the csv's. Thanks! – Nikhil VJ Apr 06 '18 at 03:00
I believe this is the best solution as well :) – ℕʘʘḆḽḘ Apr 17 '18 at 20:28
This did not work for me (python3.6, pandas 0.22.0); I still lost my leading zeros. – SummerEla May 08 '18 at 23:15
@SummerEla what is the `dtype` of your column? are you using `read_csv` as indicated? – ℕʘʘḆḽḘ May 09 '18 at 02:52
@ℕʘʘḆḽḘ I tried casting both datatypes to strings and objects. I thought this would be simple, but it's just not working. – SummerEla May 09 '18 at 18:45
what do you mean object? you should either use str or numeric – ℕʘʘḆḽḘ May 09 '18 at 18:47
Yep, I was in pandas, not pure python.. hence object instead of string. – SummerEla Jul 25 '18 at 00:26

Erick Rodriguez · Answer 3 · 2018-12-10T23:30:41.523

If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:

df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file

You could also do:

df = pd.read_csv("your_file.csv", dtype=str)

By doing this you will have all your columns as strings and you won't lose any leading zeros.

score 6 · Answer 4 · edited Sep 09 '22 at 16:05

6

You Can do This , Works On all Versions of Pandas

pd.read_csv('filename.csv', dtype={'zero_column_name': object})

edited Sep 09 '22 at 16:05

wjandrea

28,235
9
60
81

answered Nov 21 '19 at 06:27

Also, If you need to do this for all columns you can do `pd.read_csv('filename.csv', dtype=object)` so you won't have to account for every column – John Franke Oct 09 '22 at 19:09

score 2 · Answer 5 · answered May 01 '19 at 09:46

You can use converters to convert number to fixed width if you know the width.

For example, if the width is 5, then

data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})

This will do the trick. It works for pandas==0.23.0 and also read_excel.

Python3.6 or higher required.

score 1 · Answer 6 · edited Sep 12 '22 at 13:24

1

As an example, consider the following my_data.txt file:

id,A
03,5
04,6

To preserve the leading zeros for the id column:

df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df

   id  A
0  03  5
1  04  6

edited Sep 12 '22 at 13:24

Andre Nevares

711
6
21

answered Sep 05 '22 at 16:01

bigwindlee

11
2

root · Answer 7 · 2012-11-06T12:29:29.713

0

I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.

EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.

edited Nov 06 '12 at 12:29

answered Nov 06 '12 at 11:53

root

76,608
25
108
120

the features in that issue are done on the c-parser branch now and should be coming in 0.10. I just made a quick for issue #2184 and will be included in 0.9.1 coming up real soon. But yes, using dtypes should be the preferred behavior here so just keep a lookout for 0.10 in like a month or so. – Chang She Nov 06 '12 at 17:13
you should be able to make it work now if you upgrade to the latest on github master (i.e., using a converter) – Chang She Nov 06 '12 at 17:14
@ChangShe thanks, with the latest github version my converter work indeed! Looking forward to 0.10 for a cleaner solution though... – user1802883 Nov 07 '12 at 10:30
Wes Mckinney's blog page is 404. – MERose Jul 04 '16 at 12:14

How to keep leading zeros in a column when reading CSV with Pandas?

7 Answers7

Linked

Related