Python Dataframe - Keep data as string while loading from_csv

Question

from_csv picks up a '04' as one of the values and converts it to a string. How do I make sure that all columns being picked up are as string? I would want to avoid handling individual columns as there are 114 columns and I do not want to go thru the exercise of analyzing while columns are impacted.

Not really a duplicate. load from csv is not a problem. Problem is when you use from_csv method is DataFrame — Pankaj Singh, Mar 07 '17 at 15:53
@PankajSingh you can [edit] your question to include corrections... — Jon Clements, Mar 07 '17 at 15:55
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv `dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype.` -- > dtype=str — mkaran, Mar 07 '17 at 16:07
You can just do `df = pd.read_csv(your_filepath, dtype=str)` — EdChum, Mar 07 '17 at 16:07

score 5 · Answer 1 · answered Mar 07 '17 at 16:08

If you want all columns to be str then pass dtype=str to read_csv:

df = pd.read_csv(file_path, dtype=str)

will preserve any leading zeroes

Example:

In [54]:
t="""a,b
001,230
01,003"""
df = pd.read_csv(io.StringIO(t), dtype=str)
df

Out[54]:
     a    b
0  001  230
1   01  003

here the dtypes will be listed as object which is the correct dtype for str here:

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
a    2 non-null object
b    2 non-null object
dtypes: object(2)
memory usage: 112.0+ bytes

score 1 · Answer 2 · edited May 23 '17 at 12:17

1

If you have only a limited number of columns to read as strings:

Instead of from_csv use read_csv (here the documentation) and set

dtype={ 'your_column_name':np.str_ }

If all the data should be considered a string:

Edit: As pointed out in the comments, the suggested solution removes trailing zeroes from the data. EdChum's answer handles this case as requested.

Just convert the data after reading it with df.asType(np.str_). You can also convert a set of columns (of which you will still need the names though) by putting all the names in a list and then doing df[list_of_column_names] = df[list_of_column_names].asType(np.str_)

edited May 23 '17 at 12:17

Community

1
1

answered Mar 07 '17 at 15:53

GPhilo

18,519
9
63
89

That is exactly what I wanted to avoid. I have 114 columns. Above suggestion would make it set datatype for 114 columns – Pankaj Singh Mar 07 '17 at 15:56
See the updated answer for more options to avoid specifying all column names – GPhilo Mar 07 '17 at 16:06
Correct me if I'm wrong but If converted after reading then '04' that was converted to 4 will become '4', which means information may be lost. – mkaran Mar 07 '17 at 16:10
In that case trailing zeroes are removed, yes. – GPhilo Mar 08 '17 at 08:30

Python Dataframe - Keep data as string while loading from_csv

2 Answers2