76

I am new to Python, I am trying to read csv file using below script.

Past=pd.read_csv("C:/Users/Admin/Desktop/Python/Past.csv",encoding='utf-8')

But, getting error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte", Please help me to know issue here, I used encoding in script thought it will resolve error.

Liam
  • 6,009
  • 4
  • 39
  • 53
user3734568
  • 1,311
  • 2
  • 22
  • 36

8 Answers8

129

This happens because you chose the wrong encoding.

Since you are working on a Windows machine, just replacing

Past=pd.read_csv("C:/Users/.../Past.csv",encoding='utf-8') 

with

Past=pd.read_csv("C:/Users/.../Past.csv",encoding='cp1252')

should solve the problem.

Liam
  • 6,009
  • 4
  • 39
  • 53
  • 5
    How did you determine that `cp1252` was the proper encoding? Chances are it wasn't but you got lucky because it stopped throwing errors, but now you have incorrect characters in your data. – Mark Ransom Jul 28 '21 at 16:36
  • 1
    @MarkRansom yes – Liam Jul 28 '21 at 18:45
  • 4
    The way to figure out the encoding is with the chardet library. Using that on the file with this error gave me "Windows-1252" as the encoding, which is a synonym for "cp1252" (https://docs.python.org/3.8/library/codecs.html#standard-encodings). See https://stackoverflow.com/a/61025300/2800876 for how to do this – Zags Apr 07 '22 at 15:29
30

Try using :

pd.read_csv("Your filename", encoding="ISO-8859-1")

The code that I parsed from some website was converted in this encoding instead of default UTF-8 encoding which is standard.

mercury
  • 189
  • 1
  • 15
ask_me
  • 445
  • 5
  • 7
  • Welcome to StackOverflow. Answers with only code in them tend to get flagged for deletion as they are "low quality". Please read the help section on answering questions then consider adding some commentary to your Answer. – Graham Mar 07 '18 at 02:18
  • 1
    Yes, `ISO-8859-1` eliminates all the errors because every possible byte maps to a valid character. Doesn't mean the characters are correct though. How did you determine the correct encoding used by the website? – Mark Ransom Jul 28 '21 at 16:33
22

Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.

with open(path, encoding="utf8", errors='ignore') as f:

Using errors='ignore' You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server. Then its a easy direct solution. reference

Nitish Kumar Pal
  • 2,738
  • 3
  • 18
  • 23
7

The following works very well for me:

encoding = 'latin1'
Jia Gao
  • 1,172
  • 3
  • 13
  • 26
  • 2
    Yes, `latin1` eliminates all the errors because every possible byte maps to a valid character. Doesn't mean the characters are correct though. – Mark Ransom Jul 28 '21 at 16:30
  • Hi, can you be more specific? or can you please refer to some resources? Interested. – Jia Gao Jul 30 '21 at 01:42
  • You can see all the possible encodings supported by Python in [Standard Encodings](https://docs.python.org/3/library/codecs.html#standard-encodings); there are quite a few of them, and they will generate different characters when presented with the same bytes. But `latin` is unique in being the only one without invalid bytes, the only one that can do `bytes(range(256)).decode('latin1')` without generating an error. – Mark Ransom Jul 30 '21 at 04:42
  • Hi Ransom, thanks for the reply, that's helpful. Always terrified by the encoding issue. – Jia Gao Jul 30 '21 at 07:38
6

Its an old question but shows up while searching for solutions to this error. So I thought to answer for all who still stumble on this thread. The encoding for the file can be checked before passing the correct value for the encoding argument. To get the encoding, a simple option in Windows is to open the file in Notepad++ and look at the encoding. The correct value for the encoding argument can then be found in the python documentation. Look at this question and the answers on stackoverflow for more details on different possibilities to get the file encoding.

Kumar Saurabh
  • 711
  • 7
  • 7
4

Using the code bellow works for me:

with open(keeniz_dir + '/world_cities.csv',  'r', encoding='latin1') as input:
Juba Fourali
  • 740
  • 9
  • 10
  • 1
    Yes, `latin1` eliminates all the errors because every possible byte maps to a valid character. Doesn't mean the characters are correct though. – Mark Ransom Jul 28 '21 at 16:30
4
df = pd.read_csv( "/content/data.csv",encoding='latin1')

Just add ,encoding='latin1' and it will work

Developer-Felix
  • 337
  • 2
  • 4
2

Don't pass encoding option unless you are sure about file encoding. Default value encoding=None passes errors="replace" to open() function called. Characters with encoding errors will be substituted with replacements, you can then figure out correct encoding or just use the resulting Dataframe. If wrong encoding is provided pd will pass errors="strict" to open() and get ValueError if encoding is incorrect.

Jacek Błocki
  • 452
  • 3
  • 9
  • 1
    It's a good suggestion, but since pandas version 1.3.0, this default behavior doesn't hold and a new parameter 'encoding_errors' has been added. Setting that to 'replace' will now do what you described. This helps me get past this issue where I need to automatically process many files with different encodings (while making sure the substitutions don't affect my data of interest). – GreenEye Dec 20 '21 at 14:48