0

I am reading UNICODE Howto in the Python documentation. It is written that

a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF

which make it looks like the maximum number of bits needed to represent a code point is 24 (because there are 6 hexadecimal characters, and 6*4=24).

But then the documentation states:

The first encoding you might think of is using 32-bit integers as the code unit

Why is that? The first encoding I could think of is with 24-bit integers, not 32-bit.

robertspierre
  • 3,218
  • 2
  • 31
  • 46

2 Answers2

1

Actually you only need 21. Many CPUs use 32-bit registers natively, and most languages have a 32-bit integer type.

If you study the UTF-16 and UTF-8 encodings, you’ll find that their algorithms encode a maximum of a 21-bit code point using two 16-bit code units and four 8-bit code units, respectively.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
1

Because it is the standard way. Python uses different "internal encoding", depending the content of the string: ASCII/ISO, UTF-16, UTF-32. UTF-32 is a common used representation (usually just intern to programs) to represent Unicode code point. So Python, instead of reinventing an other encoding (e.g. a UTF-22), it just uses UTF-32 representation. It is also easier for the different interfaces. Not so efficient on space, but much more on string operations.

Note: Python uses (in seldom cases) also surrogate range to encode "wrong" bytes. So you need more than 10FFFF code points.

Note: Also colour encoding had a similar encoding: 8bit * 3 channels = 24bit, but often represented with 32 integers (but this also for other reasons: just a write, instead of 2 read + 2 write on bus). 32 bits is much more easier and fast to handle.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32