6

I was reading this high rated post in SO on unicodes

Here is an `illustration given there :

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

and the explanation were given as

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

My question is: why does the terminal match to the latin-1 character map when the encoding is 'UTF-8'?

Also when I tried

>>> print '\xe9'
?
>>> print u'\xe9'
é

I get different result for the first one than what is described above. why is this discrepancy and where does latin-1 come to play in this picture?

Community
  • 1
  • 1
eagertoLearn
  • 9,772
  • 23
  • 80
  • 122
  • I'm not sure how the OP managed to get that output, but that is *incorrect*, unless the OP *changed* the `sys.stdout.encoding` value. – Martijn Pieters Feb 19 '14 at 19:21

1 Answers1

7

You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.

The print output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.

When a terminal is set to UTF-8, the \xe9 byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é

If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph instead:

>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�

That's because in UTF-8, \xe9 is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:

>>> print '\xe9\x80\x80'
退

because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.

If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read:

wim
  • 338,267
  • 99
  • 616
  • 750
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • so why is that `'xe9'` and `u'\xe9` prints different and the difference between `latin-1` and `unicode`, since one of them uses one byte and other two?? (my vague understanding from that post) – eagertoLearn Feb 19 '14 at 19:24
  • I was under the impression that `?` is the unicode representation but not for a invalid `UTF-8`,that brings to the question of how `?` is represented in `UTF-8`. On the same note, `print chr(0xFF)` prints `?` – brain storm Feb 19 '14 at 19:30
  • @user1988876: `?` is just `U+003F`, or `\x3f` as a UTF-8, Latin1 or ASCII byte. – Martijn Pieters Feb 19 '14 at 19:32
  • 2
    @user1988876: *any* single byte outside the range 00-7F is invalid in UTF-8. After U+007F all codepoints require at least 2 bytes to encode. – Martijn Pieters Feb 19 '14 at 19:33
  • @MartijnPieters: Thanks for the pointing out. so would that mean it is valid in `UTF-16`? – brain storm Feb 19 '14 at 19:34
  • @user1988876: no, UTF-16 always uses multiples of *two* bytes. For most of the Unicode standard, one such pair is enough, beyond U+FFFF two pairs are used. – Martijn Pieters Feb 19 '14 at 19:37
  • `\xe9` - how do you say that this the start byte of 3-byte encoding. any table look up? or some calculation - sorry if this sounds stupid – eagertoLearn Feb 19 '14 at 19:41
  • @eagertoLearn: see the [Wikipedia article on UTF-8](https://en.wikipedia.org/wiki/UTF-8); UTF-8 is a variable-byte encoding. – Martijn Pieters Feb 19 '14 at 19:43
  • @eagertoLearn: If you look at the binary representation of hex E9, you'll see it starts with 3 bits set, one bit not set (`1110 1001`); this indicates it's the starting byte of a 3-byte character. A single-byte sequence starts with `0` (values 00-7F), a two-byte sequence stats with `110` (values C0-DF), 3 bytes with `1110` (E0-EF), etc. – Martijn Pieters Feb 19 '14 at 19:48
  • @MartijnPieters: I understand `E9` is `1110 1001`, but I do not see how it starts with a 3 `bits set` with one bit not being set. – eagertoLearn Feb 19 '14 at 20:06
  • @eagertoLearn: `1110` is three set bits (each 1), and a 0 (not set). – Martijn Pieters Feb 19 '14 at 20:18
  • @MartijnPieters:huh! I see what you mean, but how does that indicate its gthe starting byte of a 3-byte character. this is the confusing part. rest of your explanation is very clear. Thanks again – eagertoLearn Feb 19 '14 at 20:34
  • That's how the standard was designed; it makes the work of decoders easier. – Martijn Pieters Feb 19 '14 at 20:56
  • @MartijnPieters: can you give me an example of 3-byte character. I added `F` to make it three byte (`chr(0xE9F)`) but I get `Value error` – eagertoLearn Feb 19 '14 at 22:02
  • @eagertoLearn: There is such a character in the answer. You want a *unicode* codepoint (`unichr()` perhaps) or produce *3 bytes*, where `chr()` can only produce **one** byte. `chr(0xE9) + chr(0x80) + chr(0x80)` for example. – Martijn Pieters Feb 19 '14 at 22:32