Pandas (>1.3.x) read_csv UnicodeDecodeError: 'utf-8' but it worked ok with Pandas (<=1.2.5)

Question

So far I was working with Pandas 1.2.2, after to upgrade it to 1.3.1 I have the next error when I read a csv file, I didn't have any problem before upgrade.

Here is de kind of encoding for the file:

>>> with open('file.csv') as f:
...     print(f)
... 
<_io.TextIOWrapper name='file.csv' mode='r' encoding='UTF-8'>

With Pandas (<=1.2.5) the file was reading ok, this is an example of this:

Python 3.9.2 (default, Feb 19 2021, 17:23:45) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 7c48ff4409c622c582c56a5702373f726de08e96
python           : 3.9.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.10.25-linuxkit
Version          : #1 SMP Tue Mar 23 09:27:39 UTC 2021
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.2.5
numpy            : 1.21.1
pytz             : 2021.1
dateutil         : 2.8.2
pip              : 21.2.1
setuptools       : 53.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : 1.4.4
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.1
IPython          : 7.25.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : None
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : None
pyxlsb           : 1.0.8
s3fs             : None
scipy            : 1.7.0
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
>>> gmast = pd.read_csv('file.csv', decimal= ",", sep=';')

>>> gmast.shape
(191502, 6)

With Pandas(>= 1.3.x)

Python 3.9.2 (default, Feb 19 2021, 17:23:45) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : c7f7443c1bad8262358114d5e88cd9c8a308e8aa
python           : 3.9.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.10.25-linuxkit
Version          : #1 SMP Tue Mar 23 09:27:39 UTC 2021
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.1
numpy            : 1.21.1
pytz             : 2021.1
dateutil         : 2.8.2
pip              : 21.2.1
setuptools       : 53.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : 1.4.4
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.1
IPython          : 7.25.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : None
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : None
pyxlsb           : 1.0.8
s3fs             : None
scipy            : 1.7.0
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

>>> gmast = pd.read_csv('file.csv', decimal= ",", sep=';')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/aaa/.local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/aaa/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/aaa/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 488, in _read
    return parser.read(nrows)
  File "/home/aaa/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1047, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/aaa/.local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 223, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 801, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 205520: invalid start byte

If I use the encoding cp1252 it works.

gmast = pd.read_csv('file.csv', decimal= ",", sep=';', encoding='cp1252')

>>> gmast.shape
(191502, 6)

I don't understand it. Do anyone have the a similar issue? Thanks in advance.

score 2 · Accepted Answer · answered Aug 02 '21 at 08:33

According to the exeption and pandas version, the problem could be that you have non-Unicode character(s) in your file, that was suppressed before v1.3. See this bug report comment.

Also, pandas documentation introduced the encoding_errors parameter (encoding_errors str, optional, default “strict”) in version 1.3 to explicitly handle encoding errors. So you should check your file for incorrect characters.

In any case, if you want the behavior prior v1.3, you can use replace (or ignore if it better for your case):

gmast = pd.read_csv('file.csv', decimal= ",", sep=';', encoding_errors='replace')

Pandas (>1.3.x) read_csv UnicodeDecodeError: 'utf-8' but it worked ok with Pandas (<=1.2.5)

1 Answers1

Linked