Why the column type can't read as in converters's setting?

Question

I want to read a csv file with string type for specified column, the data file located here:

Please download and save it as $HOME\cbond.csv(can't upload it into dropbox and other net disk because of GFW, jianguoyun provide english gui, create your own free account and download my sample data file).

import pandas as df
df = pd.read_csv('cbond.csv',sep=',',header=0, converters={'正股代码':str})

I make the column 正股代码 in csv file as string type with converters,check all columns data type with df.info().

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239 entries, 0 to 238
Data columns (total 17 columns):
代码       239 non-null int64
转债名称     239 non-null object
现价       239 non-null float64
涨跌幅      239 non-null float64
正股名称     239 non-null object
正股价      239 non-null float64
正股涨跌     239 non-null float64
转股价      239 non-null float64
回售触发价    239 non-null float64
强赎触发价    239 non-null float64
到期时间     239 non-null object
剩余年限     239 non-null float64
正股代码     239 non-null object
转股起始日    239 non-null object
发行规模     239 non-null float64
剩余规模     239 non-null object
转股溢价率    239 non-null float64
dtypes: float64(10), int64(1), object(6)

Why the column 正股代码 is shown as

   正股代码     239 non-null object

instead of

   正股代码     239 non-null string

?

Upgrade pandas:

sudo apt-get install --upgrade  python3-pandas
Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-pandas is already the newest version (0.19.2-5.1).

Try different statements:

>>> import pandas as pd
>>> pd.__version__
'0.24.2'
>>> test_1  = pd.read_csv('cbond.csv',dtype={'正股代码':'string'})
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/dtypes/common.py", line 2011, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: data type "string" not understood

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 490, in pandas._libs.parsers.TextReader.__cinit__
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/dtypes/common.py", line 2017, in pandas_dtype
    dtype))
TypeError: data type 'string' not understood
>>> test_2  = pd.read_csv('cbond.csv',dtype={'正股代码':'str'})
>>> test_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239 entries, 0 to 238
Data columns (total 17 columns):
代码       239 non-null int64
转债名称     239 non-null object
现价       239 non-null float64
涨跌幅      239 non-null float64
正股代码     239 non-null object
正股名称     239 non-null object
正股价      239 non-null float64
正股涨跌     239 non-null float64
转股价      239 non-null float64
回售触发价    239 non-null float64
强赎触发价    239 non-null float64
到期时间     239 non-null object
剩余年限     239 non-null float64
转股起始日    239 non-null object
发行规模     239 non-null float64
剩余规模     239 non-null object
转股溢价率    239 non-null float64
dtypes: float64(10), int64(1), object(6)
memory usage: 31.8+ KB

Possible duplicate of https://stackoverflow.com/questions/22231592/pandas-change-data-type-of-series-to-string — hSin, Feb 24 '20 at 00:28
Totally different from that post, my issus focus on the argument `converters` ,instead of df['fieldname'].astype(str). — showkey, Feb 24 '20 at 00:34
specify encoding in read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html — oetzi, Feb 24 '20 at 18:55
@it_is_a_literature, not sure if this is the way you want this but this worked, try this: `df = pd.read_csv("cbond.csv",sep=',',header=0, dtype={'正股代码':'string'})` Use latest pandas, string type was not in earlier version of pandas — PKumar, Feb 25 '20 at 04:45
No use to read with `df = pd.read_csv("cbond.csv",sep=',',header=0, dtype={'正股代码':'str'})` — showkey, Feb 25 '20 at 06:24
@it_is_a_literature my first guess was that it is due to some encoding issue. Is it possible that it is the case? Unfortunately, I am currently not able to download and test it. Can you please use `read_csv` method with `chunksize` parameter. Read the file chunk by chunk like 10 rows at a time, and try to find the chunk where it throws an exception. That way you can deduct what causes the problem at column `'正股代码'`. — null, Feb 26 '20 at 07:54

emiljoj · Answer 1 · 2020-02-25T21:29:58.220

1

would it help to assign the dtype of the column after reading the csv file?

df['正股代码'] = df['正股代码'].astype('string')

With the new pandas 1.0 the String dtype is under experimentation. Read more here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.StringDtype.html#pandas.StringDtype

This worked for me:

test_df = pd.DataFrame(data={'numbers_column':np.nan,
                        'strings_column':['3_re', '4_re', '5_re','random_str']},
                  index=[1,2,3, 4])

## until here the dtype of strings_column is still object

test_df['strings_column'] = test_df['strings_column'].astype('string')

Alternatively to read it immediately as string upon opening the file, this worked for me:

 test_2 = pd.read_csv(.....,
                dtype={'正股代码':'string'})

edited Feb 25 '20 at 21:29

answered Feb 25 '20 at 21:21

emiljoj

399
1
7

which part do is not clear? did you try reading your csv as follows: df = pd.read_csv('cbond.csv',sep=',',header=0, dtype={'正股代码':string'}) – emiljoj Feb 26 '20 at 08:26
Data type string is only available from pandas 1.0 since end of Jan. 2020. You cannot save a column as dtype string in earlier versions of pandas. Let me know if updating pandas helps – emiljoj Feb 26 '20 at 14:55

Sharan Arumugam · Answer 2 · 2020-02-28T08:06:24.963

1

Prior to pandas 1.0.0 i.e. your version 0.19, there is no dtype string in pandas, could be internally np.str or StringArray from numpy. which df.info() treats as object dtype

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-data-types

edited Feb 28 '20 at 08:06

answered Feb 27 '20 at 08:40

Sharan Arumugam

353
6
12

NicoNing · Accepted Answer · 2020-03-01T14:23:00.630

Using convert_dtypes, which require Pandas>=1.0.2, it supports to convert columns to best possible dtypes using dtypes supporting pd.NA.

DOC: pandas.DataFrame.convert_dtypes

Try this:


import pandas as pd
df = pd.read_csv('cbond.csv')
dfn = df.convert_dtypes()
print(dfn)

"""
代码         Int64
转债名称      string
现价       float64
涨跌幅      float64
正股名称      string
正股价      float64
正股涨跌     float64
转股价      float64
回售触发价    float64
强赎触发价    float64
到期时间      string
剩余年限     float64
正股代码       Int64
转股起始日     string
发行规模     float64
剩余规模      string
转股溢价率    float64
dtype: object
"""

Besides, why df = pd.read_csv('cbond.csv',sep=',',header=0, converters={'正股代码':str}) or df['正股代码'] = df['正股代码'].astype('string') don't work as we want?

It seems like a bug for me/us but a feature to pandas.

Whatever, convert_dtypes has fixed this for me.

Why the column type can't read as in converters's setting?

3 Answers3