noob queries on unicode and str methods in Python

Question

I wish to seek some clarifications on Unicode and str methods in Python. After reading some explanation on Unicode, there are still couple of doubts I hope folks can help me on:

Am I right to say that when declaring a unicode string e.g word=u'foo', python uses the encoding of the terminal and decodes foo in e.g UTF-8, and assigning word the hex representation in unicode?
So, in general, is the process of printing out characters in a file, always decoding the byte stream according to the encoding to unicode representation, before displaying the mapped characters out?
In my terminal, Why does 'é'.lower() or str('é') displays in hex '\xc3\xa9', whereas 'a'.lower() does not?

score 2 · Accepted Answer · answered Dec 08 '11 at 20:06

First we should be clear we are talking about Python 2 only. Python 3 is different.

You're right. But if you write u"abcd" in a py file, the declaration of the encoding of the source file will determine how the interpreter decode you string.
You need to decode it first, and then encode it and print. In Python 2, DON'T print out unicode directly! Otherwise, if the system is encoding it in an incompatitable way (like "ascii"), an exception will be raised. You have to do all these explicitly.
The short answer is "a" doesn't have to be represented in "\x61", "a" is simply more readable. A longer answer: typically in the interactive shell, if you type a value and press enter, Python will show the repr() of your string. I think "repr" will try to print everything in ascii representation. For "a", it's already ascii, so it's outputed directly. For str "é", it's UTF-8 encoded binary stream, so Python escape each byte and print as 'xc3\xa9'

score 0 · Answer 2 · answered Dec 08 '11 at 19:44

0

I don't think Python does any automatic encoding or decoding on console I/O. Consider the following:

>>> 'é'
'\xc3\xa9'
>>> 'é'.decode('UTF-8')
u'\xe9'

You'll notice that \xe9 is the Unicode code point for 'LATIN SMALL LETTER E WITH ACUTE', while \xc3\xa9 is the byte sequence corresponding to the same character in UTF-8.

Everything changes in Python 3, since all strings are Unicode. I'm not sure of the rules there.

answered Dec 08 '11 at 19:44

Mark Ransom

299,747
42
398
622

re “automatic encoding/decoding”: check `sys.stdin.encoding` and `sys.stdout.encoding`. – tzot Dec 25 '11 at 22:38

score 0 · Answer 3 · answered Dec 08 '11 at 19:51

0

See http://www.python.org/dev/peps/pep-0263/ about how to specify encoding of Python source file. For Python interpreter there's PYTHONIOENCODING environment variable.
What OS do you use?

answered Dec 08 '11 at 19:51

pronvit

4,169
1
18
27

score 0 · Answer 4 · answered Dec 09 '11 at 02:27

The statement word = u'foo' assigns a unicode string object, not a "hex representation". Unicode objects represent sequences of text characters. Also, it is wrong to think of decoding in this context. Unicode is not an encoding, nor does it "have" an encoding.
Yes. Decode In: Encode Out.
For the repr of a non-unicode string literal, Python will use sys.stdin.encoding; for the repr of a unicode string literal, Python will use "unicode_escape".

noob queries on unicode and str methods in Python

4 Answers4

Linked