UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

Question

I am using Python-2.6 CGI scripts but found this error in server log while doing json.dumps(),

Traceback (most recent call last):
  File "/etc/mongodb/server/cgi-bin/getstats.py", line 135, in <module>
    print json.dumps(__getdata())
  File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

Here ,

__getdata() function returns dictionary {} .

Before posting this question I have referred this of question os SO.

UPDATES

Following line is hurting JSON encoder,

now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) # this is the culprit

I got a temporary fix for it

print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })

But I am not sure is it correct way to do it.

It looks like you have some string data in the dictionary that can't be encoded/decoded. What's in the `dict`? — mgilson, Mar 06 '14 at 05:53
@mgilson yup master I understood the issue but donno how to deal with it..`dict` has `list, dict, python timestamp value ` — Deepak Ingole, Mar 06 '14 at 05:53
to debug , put lines that throws error . It will be more useful . — Priyank Patel, Mar 06 '14 at 05:55
@Pilot -- Not really. The real problem is buried somewhere in `__getdata`. I don't know *why* you're getting a non-decodable character. You can try to come up with patches on the dict to make it work, but those are mostly just asking for more problems later. I would try printing the dict to see where the non-ascii character is. Then figure out how that field got calculated/set and work backward from there. — mgilson, Mar 06 '14 at 07:04
Possible duplicate of [UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c](http://stackoverflow.com/q/12468179/1677912). — Mogsdad, Mar 05 '16 at 18:27
I had that same error when trying to read a .csv file which had some non-ascii characters in it. Removing those characters (as suggested below) solved the issue. — Dmitriy R. Starson, Feb 14 '17 at 03:27
**But I am not sure is it correct way to do it.** It is indeed... — Romeo Sierra, Oct 18 '20 at 08:19

score 378 · Answer 1 · edited Aug 30 '23 at 13:12

378

If you get this error when trying to read a csv file, the read_csv() function from pandas lets you set the encoding:

import pandas as pd
data = pd.read_csv(filename, encoding='unicode_escape')

edited Aug 30 '23 at 13:12

endive1783

827
1
8
18

answered May 26 '18 at 01:15

MSalty

4,086
2
12
16

14

Only if you using `pandas` – Valeriy Dec 30 '19 at 15:46
6

sorry, this didn't working, I again had the same error. but when I used ('filename.csv', engine ='python'). This worked. – basavaraj_S Jan 28 '20 at 10:29
works for pysrt too, pysrt.open(subfilename, encoding='unicode_escape') and I think this solution should work with Un-encoded text / plain text for any library that supports encoding on file open "unicode_escape" will open file , but if you have non-ascii you should give specific encoding for example for Turkish encoding='ISO-8859-9' – Gorkem Mar 08 '21 at 11:09
I had the same problem using pandas and this solution worked. However, is there a way to change the file itself? My csv file was generated by Excel and I don't see any special characters. So I am thinking there must be a way to re-save this file so that I don't have to use 'unicode-escape'. – spark Jan 10 '23 at 15:39
Also worked inside R with reticulate. – Geo Vogler Jan 12 '23 at 19:04

score 174 · Answer 2 · edited Aug 30 '23 at 13:03

174

By default open function has io attribute 'r' as in read only. This can be set to 'rb' as in read binary.

Try the below code snippet:

with open(path, 'rb') as f:
  text = f.read()

edited Aug 30 '23 at 13:03

endive1783

827
1
8
18

answered Sep 07 '17 at 09:39

Soumyaansh

8,626
7
45
45

13

I had `r` instead of `rb`. thanks for the reminder to add `b`! – Paul Jan 13 '18 at 22:08
4

By default `open` function has 'r' as read only mode. `rb` stands for read binary mode. – shiva Feb 04 '20 at 18:52

score 118 · Accepted Answer · edited Mar 03 '17 at 05:31

118

The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded. One simple way to avoid this error is to encode such strings with encode() function as follows (if a is the string with non-ascii character):

a.encode('utf-8').strip()

edited Mar 03 '17 at 05:31

Jean-Francois T.

11,549
7
68
107

answered Mar 06 '14 at 06:28

Santosh Ghimire

3,087
8
35
63

2

Since UTF-8 is back-compatible with the oldschool 7-bit ASCII you should just encode everything. For characters in the 7-bit ASCII range this encoding will be an identity mapping. – Tadeusz A. Kadłubowski Mar 06 '14 at 07:47
81

This doesn't seem real clear. When importing a csv file how do you use this code? – Dave Sep 17 '19 at 15:13
1

The same issue appears for me when executing an sqlalchemy query, how would I encode the query (has no .encode, since its not a string)? – c8999c 3f964f64 Jul 03 '20 at 09:27

score 51 · Answer 4 · edited Feb 04 '20 at 18:48

51

Your string has a non ascii character encoded in it.

Not being able to decode with utf-8 may happen if you've needed to use other encodings in your code. For example:

>>> 'my weird character \x96'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 19: invalid start byte

In this case, the encoding is windows-1252 so you have to do:

>>> 'my weird character \x96'.decode('windows-1252')
u'my weird character \u2013'

Now that you have Unicode, you can safely encode into utf-8.

edited Feb 04 '20 at 18:48

shiva

5,083
5
23
42

answered Mar 23 '15 at 18:19

JCF

651
5
9

5

I have created a simple page which may help establish the encoding of some unexpected "mystery bytes"; https://tripleee.github.io/8bit/ – tripleee Feb 04 '20 at 18:50
This solved it for my case with pd.read(), just switching to encoding='windows-1252' instead of encoding='utf-8' – gseattle Jan 05 '23 at 10:49

score 39 · Answer 5 · edited Feb 04 '20 at 18:40

39

On read csv, I added an encoding method:

import pandas as pd
dataset = pd.read_csv('sample_data.csv', header= 0,
                        encoding= 'unicode_escape')

edited Feb 04 '20 at 18:40

shiva

5,083
5
23
42

answered Mar 28 '19 at 06:15

Krishna prasad.m

409
4
2

score 37 · Answer 6 · answered Feb 04 '20 at 18:53

37

This solution worked for me:

import pandas as pd
data = pd.read_csv("training.csv", encoding = 'unicode_escape')

answered Feb 04 '20 at 18:53

shiva

5,083
5
23
42

score 26 · Answer 7 · edited Feb 04 '20 at 18:43

26

Inspired by @aaronpenne and @Soumyaansh

f = open("file.txt", "rb")
text = f.read().decode(errors='replace')

edited Feb 04 '20 at 18:43

shiva

5,083
5
23
42

answered Jul 15 '18 at 19:13

Punnerud

7,195
2
54
44

I got "AttributeError: 'str' object has no attribute 'decode'". Not sure what went wrong? – Victor Wong Dec 06 '18 at 03:14
2

Did you include b to the "rb"? The b is for opening the file as byte-formated. If you just use r it is string, and don't include decode. – Punnerud Dec 06 '18 at 11:11

score 17 · Answer 8 · answered Feb 09 '15 at 12:23

17

Set default encoder at the top of your code

import sys
reload(sys)
sys.setdefaultencoding("ISO-8859-1")

answered Feb 09 '15 at 12:23

HimalayanCoder

9,630
6
59
60

7

I think python3 doesn't have setdefaultencoding in sys module! – Anwar Hossain Jun 20 '20 at 12:50

score 17 · Answer 9 · answered Dec 19 '19 at 08:17

17

Simple Solution:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

answered Dec 19 '19 at 08:17

Gil Baggio

13,019
3
48
37

1

The only solution that works for me of all those presented here. – lunesco Jun 24 '20 at 14:16
1

This solution helped me open csv file with `korean` language in it. Thanks – Inyoung Kim 김인영 Dec 07 '20 at 02:31

score 16 · Answer 10 · edited Feb 04 '20 at 18:42

16

As of 2018-05 this is handled directly with decode, at least for Python 3.

I'm using the below snippet for invalid start byte and invalid continuation byte type errors. Adding errors='ignore' fixed it for me.

with open(out_file, 'rb') as f:
    for line in f:
        print(line.decode(errors='ignore'))

edited Feb 04 '20 at 18:42

shiva

5,083
5
23
42

answered May 15 '18 at 22:08

aaronpenne

580
5
10

6

Of course, this silently discards information. A much better fix is to figure out what's supposed to be there, and fixing the original problem. – tripleee Feb 04 '20 at 18:47

score 14 · Answer 11 · edited Nov 02 '20 at 04:07

If the above methods are not working for you, you may want to look into changing the encoding of the csv file itself.

Using Excel:

Open csv file using Excel
Navigate to File menu option and click Save As
Click Browse to select a location to save the file
Enter intended filename
Select CSV (Comma delimited) (*.csv) option
Click Tools drop-down box and click Web Options
Under Encoding tab, select the option Unicode (UTF-8) from Save this document as drop-down list
Save the file

Using Notepad:

Open csv file using notepad
Navigate to File > Save As option
Next, select the location to the file
Select the Save as type option as All Files(.)
Specify the file name with .csv extension
From Encoding drop-down list, select UTF-8 option.
Click Save to save the file

By doing this, you should be able to import csv files without encountering the UnicodeCodeError.

This worked for me while none of the above solutions did. – Raj Salla Mar 03 '22 at 22:17 — Raj Salla, Mar 03 '22 at 22:17

score 10 · Answer 12 · answered Feb 19 '21 at 02:37

10

The following snippet worked for me.

import pandas as pd
df = pd.read_csv(filename, sep = ';', encoding = 'latin1', error_bad_lines=False) #error_bad_lines is avoid single line error

answered Feb 19 '21 at 02:37

amit haldar

129
1
10

1

error_bad_lines has been deprecated, use on_bad_lines instead, like = on_bad_lines = 'skip' https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html Need to add: this does not help for me – n.r. Apr 05 '23 at 14:27

score 7 · Answer 13 · answered Mar 19 '14 at 10:23

Following line is hurting JSON encoder,

now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) // this is the culprit

I got a temporary fix for it

print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })

Marking this as correct as a temporary fix (Not sure so).

score 4 · Answer 14 · edited Feb 04 '20 at 18:46

4

You may use any standard encoding of your specific usage and input.

utf-8 is the default.

iso8859-1 is also popular for Western Europe.

e.g: bytes_obj.decode('iso8859-1')

see: docs

edited Feb 04 '20 at 18:46

shiva

5,083
5
23
42

answered Nov 11 '19 at 11:22

NoamG

1,145
10
17

2

Blindly guessing the encoding is likely to produce more errors. Selecting iso8859-1 or cp1251 etc without actually knowing which encoding the file uses will remove the symptom, but produce garbage if you guessed wrong. If it's just a few bytes, it could take years before you notice and fix the *real* error. – tripleee Feb 04 '20 at 18:53

score 2 · Answer 15 · edited Nov 02 '20 at 04:08

After trying all the aforementioned workarounds, if it still throws the same error, you can try exporting the file as CSV (a second time if you already have). Especially if you're using scikit learn, it is best to import the dataset as a CSV file.

I spent hours together, whereas the solution was this simple. Export the file as a CSV to the directory where Anaconda or your classifier tools are installed and try.

score 1 · Answer 16 · answered Feb 11 '20 at 05:22

1

Instead of looking for ways to decode a5 (Yen ¥) or 96 (en-dash –), tell MySQL that your client is encoded "latin1", but you want "utf8" in the database.

See details in Trouble with UTF-8 characters; what I see is not what I stored

answered Feb 11 '20 at 05:22

Rick James

135,179
13
127
222

score 1 · Answer 17 · answered Jan 31 '22 at 17:54

I encountered the same error while trying to import to a pandas dataframe from an excel sheet on sharepoint. My solution was using engine='openpyxl'. I'm also using requests_negotiate_sspi to avoid storing passwords in plain text.

import requests
from io import BytesIO
from requests_negotiate_sspi import HttpNegotiateAuth
cert = r'c:\path_to\saved_certificate.cer'
target_file_url = r'https://share.companydomain.com/sites/Sitename/folder/excel_file.xlsx'
response = requests.get(target_file_url, auth=HttpNegotiateAuth(), verify=cert)
df = pd.read_excel(BytesIO(response.content), engine='openpyxl', sheet_name='Sheet1')

score 1 · Answer 18 · edited Apr 24 '22 at 00:01

1

Simple solution:

import pandas as pd

df = pd.read_csv('file_name.csv', engine='python-fwf')

If it's not working try to change the engine to 'python' or 'c'.

edited Apr 24 '22 at 00:01

Timus

10,974
5
14
28

answered Apr 20 '22 at 10:53

Ashok Kumar Rai

41
5

score 0 · Answer 19 · edited Nov 02 '20 at 04:09

0

In my case, i had to save the file as UTF8 with BOM not just as UTF8 utf8 then this error was gone.

edited Nov 02 '20 at 04:09

shiva

5,083
5
23
42

answered Oct 18 '20 at 11:32

luky

2,263
3
22
40

score 0 · Answer 20 · answered Jan 02 '21 at 00:13

0

from io import BytesIO

df = pd.read_excel(BytesIO(bytes_content), engine='openpyxl')

worked for me

answered Jan 02 '21 at 00:13

Madat Sardarli

31
1
3

Where does `bytes_content` come from? – Gino Mempin Jan 02 '21 at 01:28
bytes_content is just a sample variable, containing bytes like object – Madat Sardarli Jan 02 '21 at 18:48

score 0 · Answer 21 · answered Jan 19 '23 at 07:29

I know this doesn't fit directly to the question, but I repeatedly get directed to this when I google the error message.

I did get the error when I mistakenly tried to install a Python package like I would install requirements from a file, i.e., with -r:

# wrong: leads to the error above
pip install -r my_package.whl

# correct: without -r
pip install my_package.whl

I hope this helps others who made the same little mistake as I did without noticing.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

UPDATES

21 Answers21

Linked

Related