0

I was reading Python guide about Unicode. In this section, it says:

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.

The first encoding you might think of is an array of 32-bit integers. In this representation, the string “Python” would look like this:

   P           y           t           h           o           n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Why might we think of 32-bit integers if code points are numbers from 0 to 0x10ffff? Maybe is it assuming that we are on a 32-bit system?

floatingpurr
  • 7,749
  • 9
  • 46
  • 106
  • 2
    16 bits are not enough. The next number is 32, isn't it? – Hermann Döppes Dec 21 '15 at 11:50
  • Thanks! My problem was this: since we have 1,114,111 (0x10ffff) symbols, we would need only 21 bits to code them all. But probably, that 32-bits format is due to calculators architectures. That's right? – floatingpurr Dec 21 '15 at 12:02
  • 1
    I'd assume this and the fact that programmers tend to powers of two (for the same reason). So “first thing a programmer thinks of” would be a power of two. But there might be deeper reasons, the only way to be sure is to ask the authors, I guess. – Hermann Döppes Dec 21 '15 at 12:08
  • 4
    Possible duplicate of [Why UTF-32 exists whereas only 21 bits are necessary to encode every character?](http://stackoverflow.com/questions/6339756/why-utf-32-exists-whereas-only-21-bits-are-necessary-to-encode-every-character) – dan04 Dec 22 '15 at 23:59
  • Yep, I'm sorry. Have I to delete this question? – floatingpurr Dec 24 '15 at 00:45

0 Answers0