Unicode encoding, python documentation - why a 32-bit encoding?

Question

I am reading UNICODE Howto in the Python documentation. It is written that

a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF

which make it looks like the maximum number of bits needed to represent a code point is 24 (because there are 6 hexadecimal characters, and 6*4=24).

But then the documentation states:

The first encoding you might think of is using 32-bit integers as the code unit

Why is that? The first encoding I could think of is with 24-bit integers, not 32-bit.

Mark Tolonen · Accepted Answer · 2019-06-12T13:19:19.123

1

Actually you only need 21. Many CPUs use 32-bit registers natively, and most languages have a 32-bit integer type.

If you study the UTF-16 and UTF-8 encodings, you’ll find that their algorithms encode a maximum of a 21-bit code point using two 16-bit code units and four 8-bit code units, respectively.

edited Jun 12 '19 at 13:19

answered Jun 12 '19 at 13:07

Mark Tolonen

166,664
26
169
251

score 1 · Answer 2 · answered Jun 12 '19 at 13:58

Because it is the standard way. Python uses different "internal encoding", depending the content of the string: ASCII/ISO, UTF-16, UTF-32. UTF-32 is a common used representation (usually just intern to programs) to represent Unicode code point. So Python, instead of reinventing an other encoding (e.g. a UTF-22), it just uses UTF-32 representation. It is also easier for the different interfaces. Not so efficient on space, but much more on string operations.

Note: Python uses (in seldom cases) also surrogate range to encode "wrong" bytes. So you need more than 10FFFF code points.

Note: Also colour encoding had a similar encoding: 8bit * 3 channels = 24bit, but often represented with 32 integers (but this also for other reasons: just a write, instead of 2 read + 2 write on bus). 32 bits is much more easier and fast to handle.

Unicode encoding, python documentation - why a 32-bit encoding?

2 Answers2