0

I try to write a "string" to a file and get the following error message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)

I tried the following methods:

print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')

None of them work. I have the same error message.

What is the idea behind encoding and decoding? If I have a unicode object can I write it to the file directly or I need to transform it to a string?

How can I find out what codding is used? How can I know if it is utf-8 or ascii or something else?

ADDED

I think I have just managed to save a string into a file. print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. I get no error message and I see Chinese characters in my file.

Roman
  • 124,451
  • 167
  • 349
  • 456
  • 1
    And what's that string? – EbraHim Apr 25 '16 at 08:04
  • @EbraHim, I guess that it is a unicode object because I obtained the strings by reading them in the following way: `for line in io.open(fname, encoding="utf8"):` – Roman Apr 25 '16 at 08:06
  • @Roman for line in io.open(fname, encoding="utf8"): change the encoding to utf-8 – Mani Apr 25 '16 at 08:08
  • your question is answered here: http://stackoverflow.com/questions/6048085/python-write-unicode-text-to-a-text-file – lesingerouge Apr 25 '16 at 08:11
  • Files contain bytes. Unicode strings are made up of code points. You need to translate those into bytes, there are many ways to do that, that is called encoding. – RemcoGerlich Apr 25 '16 at 08:11

2 Answers2

3

I recently posted another answer that addresses this very issue. Key quote:

For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.

In Python 2, unicode objects are character strings. Regular str objects can be either character strings or byte strings. (Pro tip: use Python 3, it makes keeping track a lot easier.)

You should be passing character strings (not byte strings) to print, but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f. As part of the output process, Python encodes the string for you. If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing.

Without knowing what is in your txt object I can't be more specific.

David Z
  • 128,184
  • 27
  • 255
  • 279
1

I think you need to use codecs library:

import codecs

file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()

Works fine.

The Story of Encoding/Decoding:

In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). So only 1 byte was enough to assign a unique number to each letter. Assigning numbers to letters for storing in memory is called encoding. This one byte encoding that is used in python by default is named ASCII.

With growth of computers in the world, we need to have more letters and characters in computer. So 1 byte is not enough. Different encoding schemes appeared. Unicode is one of the famous. The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, i.e. the ASCII (because you need 2 bytes for this character).

EbraHim
  • 2,279
  • 2
  • 16
  • 28