latin-1 vs unicode in python

Question

I was reading this high rated post in SO on unicodes

Here is an `illustration given there :

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
Ã©
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

and the explanation were given as

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

My question is: why does the terminal match to the latin-1 character map when the encoding is 'UTF-8'?

Also when I tried

>>> print '\xe9'
?
>>> print u'\xe9'
é

I get different result for the first one than what is described above. why is this discrepancy and where does latin-1 come to play in this picture?

I'm not sure how the OP managed to get that output, but that is *incorrect*, unless the OP *changed* the `sys.stdout.encoding` value. — Martijn Pieters, Feb 19 '14 at 19:21

score 7 · Accepted Answer · edited Mar 31 '20 at 02:57

7

You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.

The print output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.

When a terminal is set to UTF-8, the \xe9 byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é

If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph � instead:

>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�

That's because in UTF-8, \xe9 is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:

>>> print '\xe9\x80\x80'
退

because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.

If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

edited Mar 31 '20 at 02:57

wim

338,267
99
616
750

answered Feb 19 '14 at 19:22

Martijn Pieters

1,048,767
296
4,058
3,343

so why is that `'xe9'` and `u'\xe9` prints different and the difference between `latin-1` and `unicode`, since one of them uses one byte and other two?? (my vague understanding from that post) – eagertoLearn Feb 19 '14 at 19:24
I was under the impression that `?` is the unicode representation but not for a invalid `UTF-8`,that brings to the question of how `?` is represented in `UTF-8`. On the same note, `print chr(0xFF)` prints `?` – brain storm Feb 19 '14 at 19:30
@user1988876: `?` is just `U+003F`, or `\x3f` as a UTF-8, Latin1 or ASCII byte. – Martijn Pieters Feb 19 '14 at 19:32
2

@user1988876: *any* single byte outside the range 00-7F is invalid in UTF-8. After U+007F all codepoints require at least 2 bytes to encode. – Martijn Pieters Feb 19 '14 at 19:33
@MartijnPieters: Thanks for the pointing out. so would that mean it is valid in `UTF-16`? – brain storm Feb 19 '14 at 19:34
@user1988876: no, UTF-16 always uses multiples of *two* bytes. For most of the Unicode standard, one such pair is enough, beyond U+FFFF two pairs are used. – Martijn Pieters Feb 19 '14 at 19:37
`\xe9` - how do you say that this the start byte of 3-byte encoding. any table look up? or some calculation - sorry if this sounds stupid – eagertoLearn Feb 19 '14 at 19:41
@eagertoLearn: see the [Wikipedia article on UTF-8](https://en.wikipedia.org/wiki/UTF-8); UTF-8 is a variable-byte encoding. – Martijn Pieters Feb 19 '14 at 19:43
@eagertoLearn: If you look at the binary representation of hex E9, you'll see it starts with 3 bits set, one bit not set (`1110 1001`); this indicates it's the starting byte of a 3-byte character. A single-byte sequence starts with `0` (values 00-7F), a two-byte sequence stats with `110` (values C0-DF), 3 bytes with `1110` (E0-EF), etc. – Martijn Pieters Feb 19 '14 at 19:48
@MartijnPieters: I understand `E9` is `1110 1001`, but I do not see how it starts with a 3 `bits set` with one bit not being set. – eagertoLearn Feb 19 '14 at 20:06
@eagertoLearn: `1110` is three set bits (each 1), and a 0 (not set). – Martijn Pieters Feb 19 '14 at 20:18
@MartijnPieters:huh! I see what you mean, but how does that indicate its gthe starting byte of a 3-byte character. this is the confusing part. rest of your explanation is very clear. Thanks again – eagertoLearn Feb 19 '14 at 20:34
That's how the standard was designed; it makes the work of decoders easier. – Martijn Pieters Feb 19 '14 at 20:56
@MartijnPieters: can you give me an example of 3-byte character. I added `F` to make it three byte (`chr(0xE9F)`) but I get `Value error` – eagertoLearn Feb 19 '14 at 22:02
@eagertoLearn: There is such a character in the answer. You want a *unicode* codepoint (`unichr()` perhaps) or produce *3 bytes*, where `chr()` can only produce **one** byte. `chr(0xE9) + chr(0x80) + chr(0x80)` for example. – Martijn Pieters Feb 19 '14 at 22:32

latin-1 vs unicode in python

1 Answers1