5

I'm reading the popular Unicode article from Joel Spolsky and there's one illustration that I don't understand.

  1. What does "Hex Min, Hex Max" mean? What do those values represent? Min and max of what?

  2. Binary can only have 1 or 0. Why do I see tons of letter "v" here?


http://www.joelonsoftware.com/articles/Unicode.html enter image description here


Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Question Everything
  • 2,249
  • 4
  • 20
  • 24

2 Answers2

11

The Hex Min/Max define the range of unicode characters (typically represented by their unicode number in HEX).

The v is referring to the bits of the original number

So the first line is saying:

The unicode characters in the range 0 (hex 00) to 127 (hex 7F) (a 7 bit number) are represented by a 1 byte bit string starting with '0' followed by all 7 bits of the unicode number.

The second line is saying:

The unicode numbers in the range 128 (hex 0800) to 2047 (07FF) (an 11 bit number) are represented by a 2 byte bit string where the first byte starts with '110' followed by the first 5 of the 11 bits, and the second byte starts with '10' followed by the remaining 6 of the 11 bits

etc

Hope that makes sense

Sodved
  • 8,428
  • 2
  • 31
  • 43
6

Note that the table in Joel's article covers code points that do not, and never will, exist in Unicode. In fact, UTF-8 never needs more than 4 bytes, though the scheme underlying UTF-8 could be extended much further, as shown.

A more nuanced version of the table is available in How does a file with Chinese characters know how many bytes to use per character? It points out some of the gaps. For example, the bytes 0xC0, 0xC1, and 0xF5..0xFF can never appear in valid UTF-8. You can also see information about invalid UTF-8 at Really good bad UTF-8 example test data.

In the table you showed, the Hex Min and Hex Max values are the minimum and maximum U+wxyz values that can be represented using the number of bytes in the 'byte sequence in binary' column. Note that the maximum code point in Unicode is U+10FFFF (and that is defined/reserved as a non-character). This is the maximum value that can be represented using the surrogate encoding scheme in UTF-16 using just 4 bytes (two UTF-16 code points).

Community
  • 1
  • 1
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278