82

I'm attempting to read a CSV file into a Dataframe in Pandas. When I try to do that, I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 55: invalid start byte

This is from code:

import pandas as pd

location = r"C:\Users\khtad\Documents\test.csv"

df = pd.read_csv(location, header=0, quotechar='"')

This is on a Windows 7 Enterprise Service Pack 1 machine and it seems to apply to every CSV file I create. In this particular case the binary from location 55 is 00101001 and location 54 is 01110011, if that matters.

Saving the file as UTF-8 with a text editor doesn't seem to help, either. Similarly, adding the param "encoding='utf-8' doesn't work, either--it returns the same error.

What is the most likely cause of this error and are there any workarounds other than abandoning the DataFrame construct for the moment and using the csv module to read in the CSV line-by-line?

khtad
  • 1,075
  • 2
  • 10
  • 17
  • 3
    have you tried passing param `encoding='utf-8'` to `read_csv`? – EdChum May 26 '15 at 15:29
  • 2
    or have you tried reading the file using csv module to check if there is an issue with the file itself? – Alexander May 26 '15 at 16:15
  • @Alexander I did successfully read the file with csv module, yes. – khtad May 26 '15 at 17:06
  • 1
    @EdChum I'll add that to the question, but yes, that's one of the things I tried. – khtad May 26 '15 at 17:06
  • 3
    You'll have to post raw input or a link to the data, you could also try `utf-16' just in case for the `encoding` – EdChum May 26 '15 at 17:12
  • A workaround: df = pd.DataFrame.from_csv(location, header=0, sep=',', encoding='utf-8') solves the problem and stuffs the CSV into the dataframe. – khtad May 26 '15 at 17:20
  • Bizarrely, this works in a new file, but not when I copy-paste the code into the old file in PyCharm. – khtad May 26 '15 at 17:31
  • Try this SO link: http://stackoverflow.com/questions/7873556/utf8-codec-cant-decode-byte-0x96-in-python – Alexander May 26 '15 at 19:14
  • 2
    Please don't use `pd.DataFrame.from_csv` it is no longer maintained, use the top level `pd.read_csv` as it more feature rich – EdChum May 27 '15 at 10:16

2 Answers2

215

Try calling read_csv with encoding='latin1', encoding='iso-8859-1' or encoding='cp1252' (these are some of the various encodings found on Windows).

maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • I was able to use all 3 of these encodings successfully. – Smitty Feb 15 '18 at 22:12
  • 2
    Carefully choose the encoding. There are a few differences such as typographic quotes. Another common one is iso-8859-15, which includes the EUR sign. – Joachim Wagner May 14 '18 at 07:55
  • 1
    This was the first thread I stumbled upon about this problem, so just for the sake of completeness: None of the above worked for my (similar) problem, but `UTF-16` as encoding did work. Try this if the ones mentioned by maxymoo fail. – Thomas Jan 29 '19 at 12:44
  • removing the encoding attribute worked for me – Somangshu Goswami Mar 13 '19 at 18:16
  • 1
    `encoding='iso-8859-1'` worked for me on windows. – Jitendra Apr 18 '19 at 10:35
  • Characters like `£` won't be decoded the same between the various encodings, so like @JoachimWagner wrote, select the encoding carefully. One way to see the result is to open the csv file with open office calc and try encodings on the [import configuration panel](https://i.stack.imgur.com/aJ6du.png) which is displayed. – mins Nov 05 '20 at 14:44
19

This works in Mac as well you can use

df= pd.read_csv('Region_count.csv', encoding ='latin1')
vvvvv
  • 25,404
  • 19
  • 49
  • 81
sushmit
  • 4,369
  • 2
  • 35
  • 38