1

from_csv picks up a '04' as one of the values and converts it to a string. How do I make sure that all columns being picked up are as string? I would want to avoid handling individual columns as there are 114 columns and I do not want to go thru the exercise of analyzing while columns are impacted.

EdChum
  • 376,765
  • 198
  • 813
  • 562
Pankaj Singh
  • 526
  • 7
  • 21
  • CORRETION: and converts it to an INT – Pankaj Singh Mar 07 '17 at 15:49
  • Not really a duplicate. load from csv is not a problem. Problem is when you use from_csv method is DataFrame – Pankaj Singh Mar 07 '17 at 15:53
  • @PankajSingh you can [edit] your question to include corrections... – Jon Clements Mar 07 '17 at 15:55
  • http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv `dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype.` -- > dtype=str – mkaran Mar 07 '17 at 16:07
  • You can just do `df = pd.read_csv(your_filepath, dtype=str)` – EdChum Mar 07 '17 at 16:07

2 Answers2

5

If you want all columns to be str then pass dtype=str to read_csv:

df = pd.read_csv(file_path, dtype=str)

will preserve any leading zeroes

Example:

In [54]:
t="""a,b
001,230
01,003"""
df = pd.read_csv(io.StringIO(t), dtype=str)
df

Out[54]:
     a    b
0  001  230
1   01  003

here the dtypes will be listed as object which is the correct dtype for str here:

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
a    2 non-null object
b    2 non-null object
dtypes: object(2)
memory usage: 112.0+ bytes
EdChum
  • 376,765
  • 198
  • 813
  • 562
1

If you have only a limited number of columns to read as strings:

Instead of from_csv use read_csv (here the documentation) and set

dtype={ 'your_column_name':np.str_ }

If all the data should be considered a string:

Edit: As pointed out in the comments, the suggested solution removes trailing zeroes from the data. EdChum's answer handles this case as requested.

Just convert the data after reading it with df.asType(np.str_). You can also convert a set of columns (of which you will still need the names though) by putting all the names in a list and then doing df[list_of_column_names] = df[list_of_column_names].asType(np.str_)

Community
  • 1
  • 1
GPhilo
  • 18,519
  • 9
  • 63
  • 89