UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

Question

I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)

I open the CSV using:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')

Then, I attempt to encode it with:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

I'm encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.

Traceback (most recent call last):
  File "push_into_db.py", line 80, in <module>
    main()
  File "push_into_db.py", line 74, in main
    district_map = buildDistrictSchoolMap()
  File "push_into_db.py", line 32, in buildDistrictSchoolMap
    county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

I think I should tell you that I'm using python 2.7.2, and this is part of an app build on django 1.4. I've read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.

You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.

What is your original file encoding? I think you should decode it according to the original encoding and then convert to utf 8 — xiao 啸, May 02 '12 at 00:21
possible duplicate of [Encoding gives "'ascii' codec can't encode character … ordinal not in range(128)"](http://stackoverflow.com/questions/2513027/encoding-gives-ascii-codec-cant-encode-character-ordinal-not-in-range128) [Ed.: and of approximately a zillion others, too, I'm sure.] — Karl Knechtel, May 02 '12 at 01:08

score 164 · Accepted Answer · edited Jun 09 '13 at 17:10

164

Unicode is not equal to UTF-8. The latter is just an encoding for the former.

You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string.

So just replace .encode with .decode, and it should work (if your .csv is UTF-8-encoded).

Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)

Update: If your input data is not UTF-8 encoded, then you have to .decode() with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.

edited Jun 09 '13 at 17:10

Ingve

1,086
2
20
39

answered May 02 '12 at 00:21

ch3ka

11,792
4
31
28

2

The reason for the error being that Python is trying to automatically decode it from the default encoding, ASCII, so that it can then encode it as he specified, to UTF-8. Since the data isn't valid ASCII, it doesn't work. – agf May 02 '12 at 00:26
7

sure, but if it's UTF8-*encoded* data (as I guess), then `.decode('utf-8')` should do the trick, nor? – ch3ka May 02 '12 at 00:29
Sure, you're probably right. I was just explaining why you get that specific error in this situation. – agf May 02 '12 at 01:06
1

Perfect! Thank you very much. So it turns out that it was .decode('latin-1') -- this makes sense because it was Ñ that was giving me the problem. Again! Thank you! – jelkimantis May 02 '12 at 01:58
1

Your solution works for some cases, but in case if I use this then I get another error **'ascii' codec can't encode character u'\xf1' in position 2: ordinal not in range(128)** – Vikash Mishra Nov 21 '16 at 13:45
This is not the case always. The 2nd answer worked for me – Yasin May 24 '17 at 14:30

score 100 · Answer 2 · edited Dec 27 '22 at 04:34

100

Just add this lines to your code:

1.Python2

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

2.Python3

import sys
from importlib import reload
reload(sys)
sys.setdefaultencoding('utf-8')

edited Dec 27 '22 at 04:34

starball

20,030
7
43
238

answered Oct 31 '16 at 17:00

khelili miliana

3,730
2
15
28

12

`AttributeError: module 'sys' has no attribute 'setdefaultencoding' does not seem to work in Python 3 – skjerns Feb 16 '18 at 13:15
1

It works for my Python 2.7, note, reload(sys) is needed, otherwise, setdefaultencoding would not be accessible. – Yu Shen Apr 16 '18 at 17:00
1

That was the only thing that made it work for me out of many SO questions. Thanks so much! – Freedo Jul 06 '19 at 00:16
1

name 'reload' is not defined – Davide Jun 02 '20 at 21:53
W00t for proselint/tools.py as well. – Flash Sheridan Jun 13 '20 at 23:03
@Davide - from importlib import reload – zyd Jan 28 '21 at 21:12
1

For python3, see the @Skrmnghrd answer – Aman Jain Jun 07 '21 at 05:02

score 45 · Answer 3 · answered Oct 20 '17 at 14:29

45

for Python 3 users. you can do

with open(csv_name_here, 'r', encoding="utf-8") as f:
    #some codes

it works with flask too :)

answered Oct 20 '17 at 14:29

Skrmnghrd

558
4
10

2

Its the first time I helped someone through here. feels good knowing I helped :) – Skrmnghrd Oct 23 '17 at 07:35
1

And you helped also to me :) All other answers did not work for file reading. Now I need to find out how to fix it also for writing ;) – user2194898 Sep 18 '19 at 10:27
1

can you send me the link of your code? I'll try to help – Skrmnghrd Dec 05 '19 at 02:57
1

Thanks! I forgot to include the 'encoding="utf-8"' part! – Bilguun Jun 09 '21 at 01:05

score 10 · Answer 4 · answered Jul 11 '17 at 14:35

10

The main reason for the error is that the default encoding assumed by python is ASCII. Hence, if the string data to be encoded by encode('utf8') contains character that is outside of ASCII range e.g. for a string like 'hgvcj터파크387', python would throw error because the string is not in the expected encoding format.

If you are using python version earlier than version 3.5, a reliable fix would be to set the default encoding assumed by python to utf8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
name = school_name.encode('utf8')

This way python would be able to anticipate characters within a string that fall outside of ASCII range.

However, if you are using python version 3.5 or above, reload() function is not available, so you would have to fix it using decode e.g.

name = school_name.decode('utf8').encode('utf8')

answered Jul 11 '17 at 14:35

Temi Fakunle

672
6
7

what is the difference between your answer and mine – khelili miliana Jul 13 '17 at 08:35
2

More detailed. People often find causal details helpful. And your code works btw, no derogation intended. – Temi Fakunle Jul 13 '17 at 09:21
1

reload is available in Python 3 you would just have to import it. from imp import reload – Meow Sep 29 '17 at 19:22
@Meow but there is no sys.setdefaultencoding in Python 3. So in context of compatibility py2\py3 some check will do, sys.getdefaultencoding() maybe. Would appreciate a piece of advice about that matter. https://stackoverflow.com/questions/28127513/attributeerror-module-object-has-no-attribute-setdefaultencoding – Konst54 Jul 06 '20 at 15:45

Boris Verkhovskiy · Answer 5 · 2023-02-10T20:34:14.287

Check which locale you're using with the locale command. If it's not en_US.UTF-8, change it like this:

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

If you don't have permission to do that you can run all your Python code like this:

PYTHONIOENCODING="UTF-8" python3 ./path/to/your/script.py

or run this command before running your Python code

export PYTHONIOENCODING="UTF-8"

to set it in the shell you run that in.

In my case, I was using POSIX, the default Ubuntu locale instead of en_US.UTF-8, so I saw this output:

$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

which caused Python to open files as ASCII instead of UTF-8.

You can check which locale Python is using like this:

>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'

locale.getpreferredencoding(False) is the function called by open() when you don't provide an encoding. The output should be 'UTF-8', but in my case it was 'ANSI_X3.4-1968', some variant of ASCII.

score 4 · Answer 6 · edited Mar 30 '18 at 00:19

4

For Python 3 users:

changing the encoding from 'ascii' to 'latin1' works.

Also, you can try finding the encoding automatically by reading the top 10000 bytes using the below snippet:

import chardet  
with open("dataset_path", 'rb') as rawdata:  
            result = chardet.detect(rawdata.read(10000))  
print(result)

edited Mar 30 '18 at 00:19

Stephen Rauch

47,830
31
106
135

answered Mar 29 '18 at 23:58

Prithvi

43
6

score 1 · Answer 7 · answered Jul 12 '19 at 08:21

if you get this issue while running certbot while creating or renewing certificate, Please use the following method

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

That command found the offending character "´" in one .conf file in the comment. After removing it (you can edit comments as you wish) and reloading nginx, everything worked again.

Source :https://github.com/certbot/certbot/issues/5236

score 1 · Answer 8 · answered Jan 02 '20 at 22:33

1

Or when you deal with text in Python if it is a Unicode text, make a note it is Unicode.

Set text=u'unicode text' instead just text='unicode text'.

This worked in my case.

answered Jan 02 '20 at 22:33

prosti

42,291
14
186
151

dom · Answer 9 · 2021-09-16T08:32:49.393

Dealing with this issue inside of a Docker container. It might be the case (as it was for me) that you only need to generate the locale and do nothing more:

sudo locale-gen en_US en_US.UTF-8

In some case that was sufficient for me because locales was already installed and configured. If you have to install locales and configure it, add the following part to your Dockerfile:

RUN apt update && apt install locales && \
    sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && \
    echo 'LANG="en_US.UTF-8"'>/etc/default/locale && \
    dpkg-reconfigure --frontend=noninteractive locales && \
    update-locale LANG=en_US.UTF-8

ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ENV LC_ALL en_US.UTF-8

I tested it like this:

cat <<EOF > /tmp/test.txt
++*=|@#|¼üöäàéàè!´]]¬|¢|¢¬|{ł|¼½{}}
EOF

python3
import pathlib; pathlib.Path("/tmp/test.txt").read_text()

https://hub.docker.com/_/ubuntu recommends doing it this way: `RUN apt-get update && apt-get install -y locales && rm -rf /var/lib/apt/lists/* && localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8` and then `ENV LANG en_US.utf8`. Alternatively it says you can probably get away with just doing `ENV LANG C.UTF-8` (see the "Locales" section) — Boris Verkhovskiy, Feb 10 '23 at 20:51

score 0 · Answer 10 · edited Jun 19 '20 at 20:24

0

open with encoding UTF 16 because of lat and long.

with open(csv_name_here, 'r', encoding="utf-16") as f:

edited Jun 19 '20 at 20:24

Saeed

3,294
5
35
52

answered Jun 17 '19 at 20:13

karthik r

989
13
11

score 0 · Answer 11 · answered Jun 23 '20 at 11:05

0

It does work by just taking the argument 'rb' read binary instead of 'r' read

answered Jun 23 '20 at 11:05

Jose

17
6

score 0 · Answer 12 · answered Mar 22 '22 at 13:44

0

I faced this issue while using Pickle for unloading. Try,

data = pickle.load(f,encoding='latin1')

answered Mar 22 '22 at 13:44

Kavya Goyal

126
1
2

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

12 Answers12

Linked

Related