0

I have a small script to read and print a .csv file using pandas generated from MS Excel.

import pandas as pd
data = pd.read_csv('./2010-11.csv')
print(data)

now this script runs in Python 2.7.8 but in Python 3.4.1 gives the following error. Any ideas why this might be so? Thanks in advance for any help with this.

Traceback (most recent call last):
  File "proc_csv_0-0.py", line 3, in <module>
    data = pd.read_csv('./2010-11.csv')
  File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 260, in _read
    return parser.read()
  File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 721, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 1170, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7566)
  File "pandas/parser.pyx", line 791, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7806)
  File "pandas/parser.pyx", line 866, in pandas.parser.TextReader._read_rows (pandas/parser.c:8639)
  File "pandas/parser.pyx", line 973, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9950)
  File "pandas/parser.pyx", line 1033, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10737)
  File "pandas/parser.pyx", line 1130, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12141)
  File "pandas/parser.pyx", line 1150, in pandas.parser.TextReader._string_convert (pandas/parser.c:12355)
  File "pandas/parser.pyx", line 1382, in pandas.parser._string_box_utf8 (pandas/parser.c:17679)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 4: unexpected end of data
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
Zak
  • 21
  • 1
  • 2

1 Answers1

0

In Python3, when pd.read_csv is passed a file path (as opposed to a file buffer) it decodes the contents with the utf-8 codec by default.1 It appears your CSV file is using a different encoding. Since it was generated by MS Excel, it might be cp-1252:

In [25]: print('\xc9'.decode('cp1252'))
É

In [27]: import unicodedata as UDAT   
In [28]: UDAT.name('\xc9'.decode('cp1252'))
Out[28]: 'LATIN CAPITAL LETTER E WITH ACUTE'

The error message

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9

says that '\xc9'.decode('utf-8') raises a UnicodeDecodeError.

The above shows byte 0xc9 can be decoded with cp1252. It remains to be seen if the rest of the file can also be decoded with cp1252, and if it produces the desired result.

Unfortunately, given only a file, there is no surefire way to tell what encoding (if any) was used. It depends entirely on the program used to generate the file.

If cp1252 is the right encoding, then to load the file into a DataFrame use

data = pd.read_csv('./2010-11.csv', encoding='cp1252') 

1 When pd.read_csv is passed a buffer, the buffer could have been opened with encoding already set:

# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
    df = pd.read_csv(f)
    print(df)

in which case pd.read_csv will not attempt to decode since the buffer f is already supplying decoded strings.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Spot on, thanks. Any idea why the script work in py_2.7 without the encoding but not in py_3.4 ? – Zak Jul 20 '15 at 12:09
  • In Python2.7, `pd.read_csv` seems to leave the data as bytes. In Python3, `pd.read_csv` tries to decode the bytes. – unutbu Jul 20 '15 at 12:11