Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

Question

I've been parsing some docx files (UTF-8 encoded XML) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I'm unable to output data to the file,

Traceback (most recent call last):
File "./test.py", line 360, in
ofile.write(u'\t\t\t\t\t\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 37: ordinal not in range(128)

Although I explicitly cast the word variable to unicode type (type(word) returned unicode), I tried to encode it with .encode('utf-8) I'm still stuck with this error.

Here is a sample of the code as it looks now:

for word in word_list:
    word = unicode(word)
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

I also tried the following:

for word in word_list:
    word = word.encode('utf-8')
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

Even the combination of these two:

word = unicode(word)
word = word.encode('utf-8')

I was kind of desperate so I even tried to encode the word variable inside the ofile.write()

ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')

I would appreciate any hints of what I'm doing wrong.

I bet you wouldn't have these problems if you were using the latest version of Python. — Oleh Prypin, Nov 22 '12 at 12:10
unfortunately I got the same error with more suited encoding and I can't use latest Python version because there is v2.7 on server where script will be used — gilipf, Nov 22 '12 at 12:16
[This answer](http://stackoverflow.com/a/844443/1258041) may help. — Lev Levitsky, Nov 22 '12 at 12:16

phihag · Accepted Answer · 2012-11-22T12:20:14.890

11

ofile is a bytestream, which you are writing a character string to. Therefore, it tries to handle your mistake by encoding to a byte string. This is only generally safe with ASCII characters. Since word contains non-ASCII characters, it fails:

>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
                    ordinal not in range(128)

Make ofile a text stream by opening the file with io.open, with a mode like 'wt', and an explicit encoding:

>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L

Alternatively, you can also use codecs.open with pretty much the same interface, or encode all strings manually with encode.

edited Nov 22 '12 at 12:20

answered Nov 22 '12 at 12:13

phihag

278,196
72
453
469

is it important for outputing with `io` to read the data from input with the same way? – gilipf Nov 22 '12 at 12:28
@rivfaader No, not at all. Just make sure the data consists of only `unicode` objects. It may help to run your code in Python 3.3+, because it won't silently let `bytes` objects pass. – phihag Nov 22 '12 at 12:33
Well thank you very much, I asked because I got the same error even if I opened it the right way... But I encoded all variables and it works, thanks again – gilipf Nov 22 '12 at 12:39
Hi @phihag I' m usin the IOByte to upload a file to my server. For some files I get the same error: ` UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 1: ordinal not in range(128)` How can I solve this? How can I apply the same approche you mentioned here to an IOByte object? – Rafael Soares - tuelho May 21 '15 at 13:16
@Tuelho Please ask a new question. Don't forget to post a [reproducible example](http://sscce.org/). If you want, you can [mail me](mailto:phihag@phihag.de) or comment here with a link to the new question. – phihag May 21 '15 at 16:23

score 2 · Answer 2 · answered Nov 22 '12 at 12:32

2

Phihag's answer is correct. I just want to propose to convert the unicode to a byte-string manually with an explicit encoding:

ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' +
             word + u'"/>\n').encode('utf-8'))

(Maybe you like to know how it's done using basic mechanisms instead of advanced wizardry and black magic like io.open.)

answered Nov 22 '12 at 12:32

Alfe

56,346
20
107
159

Umm, why is `io.open` advanced wizardry or even [black magic](http://catb.org/jargon/html/B/black-magic.html)? I'm pretty sure it's not that hard to understand the difference between a bytestream and a textstream, given that one has a [minimal mental model of the difference](http://www.joelonsoftware.com/articles/Unicode.html), which I'd expect from every programmer. – phihag Nov 22 '12 at 12:35
Every science sufficiently advanced is indistinguishable from magic. (Arthur C. Clark) What I meant is that your answer solves a special case (streaming) by making use of a lib and you do not need to understand how it is doing this (i. e. magic). In fact you run into this kind of trouble in more cases than just with streams and it's always good to know how to solve it in general. Lots of places automatically convert a given unicode into a str and raise errors as soon as there are strange characters inside. – Alfe Nov 22 '12 at 14:12
In Python 3, there are virtually no autoconverts. Note that `io.open` is only in the stdlib (and not builtins) in Python 2. In Python 3 `io.open is open`. – phihag Nov 22 '12 at 14:13

score 2 · Answer 3 · answered Nov 30 '14 at 20:49

I've had a similar error when writing to word documents (.docx). Specifically with the Euro symbol (€).

x = "€".encode()

Which gave the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

How I solved it was by:

x = "€".decode()

I hope this helps!

score 1 · Answer 4 · edited May 23 '17 at 12:01

1

The best solution i found in stackoverflow is in this post: How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte" put in the beggining of the code and the default codification will be utf8

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

edited May 23 '17 at 12:01

Community

1
1

answered Nov 14 '16 at 12:59

Jose R. Zapata

709
6
13

Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

4 Answers4

Linked