So far I was working with Pandas 1.2.2, after to upgrade it to 1.3.1 I have the next error when I read a csv file, I didn't have any problem before upgrade.
Here is de kind of encoding for the file:
>>> with open('file.csv') as f:
... print(f)
...
<_io.TextIOWrapper name='file.csv' mode='r' encoding='UTF-8'>
With Pandas (<=1.2.5) the file was reading ok, this is an example of this:
Python 3.9.2 (default, Feb 19 2021, 17:23:45)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 7c48ff4409c622c582c56a5702373f726de08e96
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.25-linuxkit
Version : #1 SMP Tue Mar 23 09:27:39 UTC 2021
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.5
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.1
setuptools : 53.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : 1.0.8
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
>>> gmast = pd.read_csv('file.csv', decimal= ",", sep=';')
>>> gmast.shape
(191502, 6)
With Pandas(>= 1.3.x)
Python 3.9.2 (default, Feb 19 2021, 17:23:45)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit : c7f7443c1bad8262358114d5e88cd9c8a308e8aa
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.25-linuxkit
Version : #1 SMP Tue Mar 23 09:27:39 UTC 2021
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.1
setuptools : 53.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : 1.0.8
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
>>> gmast = pd.read_csv('file.csv', decimal= ",", sep=';')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/aaa/.local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/aaa/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/aaa/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 488, in _read
return parser.read(nrows)
File "/home/aaa/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1047, in read
index, columns, col_dict = self._engine.read(nrows)
File "/home/aaa/.local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 223, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 801, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 205520: invalid start byte
If I use the encoding cp1252 it works.
gmast = pd.read_csv('file.csv', decimal= ",", sep=';', encoding='cp1252')
>>> gmast.shape
(191502, 6)
I don't understand it. Do anyone have the a similar issue? Thanks in advance.