from_csv picks up a '04' as one of the values and converts it to a string. How do I make sure that all columns being picked up are as string? I would want to avoid handling individual columns as there are 114 columns and I do not want to go thru the exercise of analyzing while columns are impacted.
-
CORRETION: and converts it to an INT – Pankaj Singh Mar 07 '17 at 15:49
-
Not really a duplicate. load from csv is not a problem. Problem is when you use from_csv method is DataFrame – Pankaj Singh Mar 07 '17 at 15:53
-
@PankajSingh you can [edit] your question to include corrections... – Jon Clements Mar 07 '17 at 15:55
-
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv `dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype.` -- > dtype=str – mkaran Mar 07 '17 at 16:07
-
You can just do `df = pd.read_csv(your_filepath, dtype=str)` – EdChum Mar 07 '17 at 16:07
2 Answers
If you want all columns to be str
then pass dtype=str
to read_csv
:
df = pd.read_csv(file_path, dtype=str)
will preserve any leading zeroes
Example:
In [54]:
t="""a,b
001,230
01,003"""
df = pd.read_csv(io.StringIO(t), dtype=str)
df
Out[54]:
a b
0 001 230
1 01 003
here the dtypes will be listed as object
which is the correct dtype for str
here:
In [55]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
a 2 non-null object
b 2 non-null object
dtypes: object(2)
memory usage: 112.0+ bytes

- 376,765
- 198
- 813
- 562
If you have only a limited number of columns to read as strings:
Instead of from_csv use read_csv
(here the documentation) and set
dtype={ 'your_column_name':np.str_ }
If all the data should be considered a string:
Edit: As pointed out in the comments, the suggested solution removes trailing zeroes from the data. EdChum's answer handles this case as requested.
Just convert the data after reading it with df.asType(np.str_)
. You can also convert a set of columns (of which you will still need the names though) by putting all the names in a list and then doing df[list_of_column_names] = df[list_of_column_names].asType(np.str_)
-
That is exactly what I wanted to avoid. I have 114 columns. Above suggestion would make it set datatype for 114 columns – Pankaj Singh Mar 07 '17 at 15:56
-
See the updated answer for more options to avoid specifying all column names – GPhilo Mar 07 '17 at 16:06
-
Correct me if I'm wrong but If converted after reading then '04' that was converted to 4 will become '4', which means information may be lost. – mkaran Mar 07 '17 at 16:10
-