4

I've been parsing some docx files (UTF-8 encoded XML) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I'm unable to output data to the file,

Traceback (most recent call last):
File "./test.py", line 360, in
ofile.write(u'\t\t\t\t\t\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 37: ordinal not in range(128)

Although I explicitly cast the word variable to unicode type (type(word) returned unicode), I tried to encode it with .encode('utf-8) I'm still stuck with this error.

Here is a sample of the code as it looks now:

for word in word_list:
    word = unicode(word)
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

I also tried the following:

for word in word_list:
    word = word.encode('utf-8')
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

Even the combination of these two:

word = unicode(word)
word = word.encode('utf-8')

I was kind of desperate so I even tried to encode the word variable inside the ofile.write()

ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')

I would appreciate any hints of what I'm doing wrong.

dda
  • 6,030
  • 2
  • 25
  • 34
gilipf
  • 139
  • 1
  • 2
  • 9

4 Answers4

11

ofile is a bytestream, which you are writing a character string to. Therefore, it tries to handle your mistake by encoding to a byte string. This is only generally safe with ASCII characters. Since word contains non-ASCII characters, it fails:

>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
                    ordinal not in range(128)

Make ofile a text stream by opening the file with io.open, with a mode like 'wt', and an explicit encoding:

>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L

Alternatively, you can also use codecs.open with pretty much the same interface, or encode all strings manually with encode.

phihag
  • 278,196
  • 72
  • 453
  • 469
  • is it important for outputing with `io` to read the data from input with the same way? – gilipf Nov 22 '12 at 12:28
  • @rivfaader No, not at all. Just make sure the data consists of only `unicode` objects. It may help to run your code in Python 3.3+, because it won't silently let `bytes` objects pass. – phihag Nov 22 '12 at 12:33
  • Well thank you very much, I asked because I got the same error even if I opened it the right way... But I encoded all variables and it works, thanks again – gilipf Nov 22 '12 at 12:39
  • Hi @phihag I' m usin the IOByte to upload a file to my server. For some files I get the same error: ` UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 1: ordinal not in range(128)` How can I solve this? How can I apply the same approche you mentioned here to an IOByte object? – Rafael Soares - tuelho May 21 '15 at 13:16
  • @Tuelho Please ask a new question. Don't forget to post a [reproducible example](http://sscce.org/). If you want, you can [mail me](mailto:phihag@phihag.de) or comment here with a link to the new question. – phihag May 21 '15 at 16:23
2

Phihag's answer is correct. I just want to propose to convert the unicode to a byte-string manually with an explicit encoding:

ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' +
             word + u'"/>\n').encode('utf-8'))

(Maybe you like to know how it's done using basic mechanisms instead of advanced wizardry and black magic like io.open.)

Alfe
  • 56,346
  • 20
  • 107
  • 159
  • Umm, why is `io.open` advanced wizardry or even [black magic](http://catb.org/jargon/html/B/black-magic.html)? I'm pretty sure it's not that hard to understand the difference between a bytestream and a textstream, given that one has a [minimal mental model of the difference](http://www.joelonsoftware.com/articles/Unicode.html), which I'd expect from every programmer. – phihag Nov 22 '12 at 12:35
  • Every science sufficiently advanced is indistinguishable from magic. (Arthur C. Clark) What I meant is that your answer solves a special case (streaming) by making use of a lib and you do not need to understand how it is doing this (i. e. magic). In fact you run into this kind of trouble in more cases than just with streams and it's always good to know how to solve it in general. Lots of places automatically convert a given unicode into a str and raise errors as soon as there are strange characters inside. – Alfe Nov 22 '12 at 14:12
  • In Python 3, there are virtually no autoconverts. Note that `io.open` is only in the stdlib (and not builtins) in Python 2. In Python 3 `io.open is open`. – phihag Nov 22 '12 at 14:13
2

I've had a similar error when writing to word documents (.docx). Specifically with the Euro symbol (€).

x = "€".encode()

Which gave the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

How I solved it was by:

x = "€".decode()

I hope this helps!

John Paul Hayes
  • 778
  • 6
  • 13
1

The best solution i found in stackoverflow is in this post: How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte" put in the beggining of the code and the default codification will be utf8

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
Community
  • 1
  • 1
Jose R. Zapata
  • 709
  • 6
  • 13