Cay anyone show me the calculation how utf-8 represents 1112064 characters?

Question

I am not understanding how UTF-8 represents 1112064 characters.

My calculation is something like this: 2⁷ + 2¹¹ + 2¹⁶ + 2²¹ = 2164864 characters.

To represent any character in UTF-8, for 1 byte it has 7 bits, for 2 bytes it has 11 bits, for 3 bytes it has 16 bits, and for 4 bytes it has 21 bits.

Is the number 1112064 without Emojis?

Where did you get this number from? [Wikipedia says](https://en.wikipedia.org/wiki/UTF-8) `UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode` which doesn't imply that UTF-8 can encode _exactly_ 1112064 values. — tkausl, Jul 02 '22 at 03:32
This is just a terminology issue. You're right, UTF-8 can encode 2164864 different values, technically, but since UTF-8 was designed to encode _Unicode_ code points, it can only encode (all) 1112064 code points, all other values are invalid in UTF-8. — tkausl, Jul 02 '22 at 03:55
Related: https://stackoverflow.com/questions/130438/do-utf-8-utf-16-and-utf-32-differ-in-the-number-of-characters-they-can-store — dan04, Jul 07 '22 at 21:01
@HasnatPie Your calculation is flawed. The 2^7 range covers only codepoints U+0000..U+007F, the 2^11 range covers only codepoints U+0080..U+07FF, the 2^16 range covers only codepoints U+0800..U+FFFF, and the 2^21 range covers only codepoints U+10000..U+10FFFF. That is 1114112 codepoints, minus 2048 codepoints that Unicode reserves that can't be used, for a grand total of 1112064 codepoints. Your calculation is re-counting the same values over and over in higher ranges, which is why the result is larger than you are expecting. — Remy Lebeau, Jul 07 '22 at 21:07

score 4 · Accepted Answer · edited Jul 07 '22 at 20:59

4

1112064 is the number of valid Unicode code points. It consists of 17 regions of 65536 code points, U+NN0000..U+NNFFFF where NN is 0x00 (the BMP, or Basic Multilingual Plane) through 0x10, less the reserved 2048 code points used for surrogates in the UTF-16 encoding, U+D800..U+DFFF.

17 x 65536 - 2048 = 1112064

UTF-8 can represent more than that, but the specification restricts valid UTF-8 to only valid Unicode code points, encoded in the shortest representation, e.g. U+0000 can be encoded as 1-byte 0x00 and also 2-byte 0xC0 0x80, but the latter is invalid, as well as 3-byte and greater versions.

edited Jul 07 '22 at 20:59

Remy Lebeau

555,201
31
458
770

answered Jul 02 '22 at 07:06

Mark Tolonen

166,664
26
169
251

Do the reserved 2048 code points exist in the basic multilingual plane? That is U+00D800..U+00DFFF are the reserved code points. So the BMP itself has 65536-2048 = 63488 valid code points. Also when you use 4 hexadecimal values, is it assumed you are using a code point in the BMP and not in one of the supplementary planes? – Nicholas Cousar Apr 26 '23 at 00:26
1

@NicholasCousar Yes, and no matter how you write it, code points < U+10000 are the BMP. U+00xxxx or U+xxxx are the same. – Mark Tolonen Apr 26 '23 at 00:34

score 0 · Answer 2 · answered Aug 27 '22 at 05:25

4 ^ 8   +   4 ^    10                                 =  1,114,112

4 ^ 8   +   4 ^    10    -       (  4 ^ 4     ) * 8   =  1,112,064

4 ^ 8   +   4 ^ (5 * 2)  -       (      4 ^ 5 ) * 2   =  1,112,064

4 ^ 8   +   4 ^ (5 * 2)  -   2 ^ (  5 + 2 + 4 )       =  1,112,064

4 ^ 8   +   4 ^ (5 * 2)  -   2 ^ ( -5 + 2 ^ 4 )       =  1,112,064

——————————————————————————————————————————————————

B.M.P.       supp-          surrogates
             planes

fun side notes :

4 ^ 2 ^ 5 = 16 ^ 16 = 2 ^ 64
2 ^ 4 ^ 5 = = 2 ^ 1024

Cay anyone show me the calculation how utf-8 represents 1112064 characters?

2 Answers2