1

I am not understanding how UTF-8 represents 1112064 characters.

My calculation is something like this: 27 + 211 + 216 + 221 = 2164864 characters.

To represent any character in UTF-8, for 1 byte it has 7 bits, for 2 bytes it has 11 bits, for 3 bytes it has 16 bits, and for 4 bytes it has 21 bits.

Is the number 1112064 without Emojis?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Hasnat Pie
  • 17
  • 4
  • 1
    Where did you get this number from? [Wikipedia says](https://en.wikipedia.org/wiki/UTF-8) `UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode` which doesn't imply that UTF-8 can encode _exactly_ 1112064 values. – tkausl Jul 02 '22 at 03:32
  • from [this](https://youtu.be/Mcuqzx3rBWc?t=292) tutorial – Hasnat Pie Jul 02 '22 at 03:47
  • 2
    This is just a terminology issue. You're right, UTF-8 can encode 2164864 different values, technically, but since UTF-8 was designed to encode _Unicode_ code points, it can only encode (all) 1112064 code points, all other values are invalid in UTF-8. – tkausl Jul 02 '22 at 03:55
  • Related: https://stackoverflow.com/questions/130438/do-utf-8-utf-16-and-utf-32-differ-in-the-number-of-characters-they-can-store – dan04 Jul 07 '22 at 21:01
  • @HasnatPie Your calculation is flawed. The 2^7 range covers only codepoints U+0000..U+007F, the 2^11 range covers only codepoints U+0080..U+07FF, the 2^16 range covers only codepoints U+0800..U+FFFF, and the 2^21 range covers only codepoints U+10000..U+10FFFF. That is 1114112 codepoints, minus 2048 codepoints that Unicode reserves that can't be used, for a grand total of 1112064 codepoints. Your calculation is re-counting the same values over and over in higher ranges, which is why the result is larger than you are expecting. – Remy Lebeau Jul 07 '22 at 21:07

2 Answers2

4

1112064 is the number of valid Unicode code points. It consists of 17 regions of 65536 code points, U+NN0000..U+NNFFFF where NN is 0x00 (the BMP, or Basic Multilingual Plane) through 0x10, less the reserved 2048 code points used for surrogates in the UTF-16 encoding, U+D800..U+DFFF.

17 x 65536 - 2048 = 1112064

UTF-8 can represent more than that, but the specification restricts valid UTF-8 to only valid Unicode code points, encoded in the shortest representation, e.g. U+0000 can be encoded as 1-byte 0x00 and also 2-byte 0xC0 0x80, but the latter is invalid, as well as 3-byte and greater versions.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Do the reserved 2048 code points exist in the basic multilingual plane? That is U+00D800..U+00DFFF are the reserved code points. So the BMP itself has 65536-2048 = 63488 valid code points. Also when you use 4 hexadecimal values, is it assumed you are using a code point in the BMP and not in one of the supplementary planes? – Nicholas Cousar Apr 26 '23 at 00:26
  • 1
    @NicholasCousar Yes, and no matter how you write it, code points < U+10000 are the BMP. U+00xxxx or U+xxxx are the same. – Mark Tolonen Apr 26 '23 at 00:34
0
4 ^ 8   +   4 ^    10                                 =  1,114,112

4 ^ 8   +   4 ^    10    -       (  4 ^ 4     ) * 8   =  1,112,064

4 ^ 8   +   4 ^ (5 * 2)  -       (      4 ^ 5 ) * 2   =  1,112,064

4 ^ 8   +   4 ^ (5 * 2)  -   2 ^ (  5 + 2 + 4 )       =  1,112,064

4 ^ 8   +   4 ^ (5 * 2)  -   2 ^ ( -5 + 2 ^ 4 )       =  1,112,064

——————————————————————————————————————————————————

B.M.P.       supp-          surrogates
             planes

fun side notes :

  • 4 ^ 2 ^ 5 = 16 ^ 16 = 2 ^ 64
  • 2 ^ 4 ^ 5 = = 2 ^ 1024
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11