27

I have read many articles in order to know what is the maximum number of the Unicode code points, but I did not find a final answer.

I understood that the Unicode code points were minimized to make all of the UTF-8 UTF-16 and UTF-32 encodings able to handle the same number of code points. But what is this number of code points?

The most frequent answer I encountered is that Unicode code points are in the range of 0x000000 to 0x10FFFF (1,114,112 code points) but I have also read in other places that it is 1,112,114 code points. So is there a one number to be given or is the issue more complicated than that?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • possible duplicate of [How many characters can be mapped with Unicode?](http://stackoverflow.com/questions/5924105/how-many-characters-can-be-mapped-with-unicode) – Jonathan Leffler Dec 11 '14 at 06:20

3 Answers3

53

The maximum valid code point in Unicode is U+10FFFF, which makes it a 21-bit code set (but not all 21-bit integers are valid Unicode code points; specifically the values from 0x110000 to 0x1FFFFF are not valid Unicode code points).

This is where the number 1,114,112 comes from: U+0000 .. U+10FFFF is 1,114,112 values.

However, there are also a set of code points that are the surrogates for UTF-16. These are in the range U+D800 .. U+DFFF. This is 2048 code points that are reserved for UTF-16.

1,114,112 - 2,048 = 1,112,064

There are also 66 non-characters. These are defined in part in Corrigendum #9: 34 values of the form U+nFFFE and U+nFFFF (where n is a value 0x00000, 0x10000, … 0xF0000, 0x100000), and 32 values U+FDD0 - U+FDEF. Subtracting those too yields 1,111,998 allocatable characters. There are three ranges reserved for 'private use': U+E000 .. U+F8FF, U+F0000 .. U+FFFFD, and U+100000 .. U+10FFFD. And the number of values actually assigned depends on the version of Unicode you're looking at. You can find information about the latest version at the Unicode Consortium. Amongst other things, the Introduction there says:

The Unicode Standard, Version 7.0, contains 112,956 characters

So only about 10% of the available code points have been allocated.

I can't account for why you found 1,112,114 as the number of code points.

Incidentally, the upper limit U+10FFFF is chosen so that all the values in Unicode can be represented in one or two 2-byte coding units in UTF-16, using one high surrogate and one low surrogate to represent values outside the BMP or Basic Multilingual Plane, which is the range U+0000 .. U+FFFF.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 5
    1,112,114 look like a typo, it'a the 000000..10FFFF count with the 2 and 4 transposed. – user313114 Dec 11 '14 at 05:47
  • 1
    @user313114: I suppose it could be a compound typo as you suggest; that is at least better than no explanation (which is roughly what I've got at the moment). I could make efforts to account for (slightly) smaller numbers of characters than the 1,112,064, but not for slightly larger. – Jonathan Leffler Dec 11 '14 at 05:49
  • 1
    See Philipp's answer: http://stackoverflow.com/questions/5924105/how-many-characters-can-be-mapped-with-unicode –  Dec 11 '14 at 05:51
  • @JonathanLeffler I believe there may be an ancient typo in this answer. In "…specifically the values from 0x11000 to 0x1FFFF are not valid…", I believe the range should be from 0x110000 to 0x1FFFFF (i.e both numbers are missing a digit). This makes more sense with your explanation, since the latter range represents 21-bit integers that aren't valid code points, while the one in your answer includes [valid code points](https://unicode.org/charts/PDF/U11000.pdf). – ravron Dec 19 '18 at 15:50
  • @ravron: You are right; thank you. I've fixed it, I think/hope. – Jonathan Leffler Dec 19 '18 at 16:03
4

Yes, all the code points that can't be represented in UTF-16 (including using surrogates) have been declared invalid.

U+10FFD seems to be the highest code point, but the surrogates, U+00FFFE and U+00FFFF aren't usable code points so the total count is a bit lower.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
user313114
  • 311
  • 1
  • 3
2

I have made a very little routine that prints onscreen a very long table, from 0 to n values where the var start is a number that can be customizable by the user. This is the snippet:

function getVal()
   {
    var start = parseInt(document.getElementById('start').value);
    var range = parseInt(document.getElementById('range').value);
    var end = start + range;
    return [start, range, end];
   }

  
   function next()
   {
    var values = getVal();
    document.getElementById('start').value = values[2];
    document.getElementById('ok').click();
   }
   
   function prev()
   {
    var values = getVal();
    document.getElementById('start').value = values[0] - values[1];
    document.getElementById('ok').click();
   }
   
   function renderCharCodeTable()
   {
    var values = getVal();
    var start = values[0];
    var end = values[2];

    const MINSTART = 0; // Allowed range
    const MAXEND = 4294967294; // Allowed range
    
    start = start < MINSTART ? MINSTART : start;
    end = end < MINSTART ? (MINSTART + 1) : end;

    start = start > MAXEND ? (MAXEND - 1) : start;
    end = end >= MAXEND ? (MAXEND + 1) : end;
    
    var tr = [];
    
    var unicodeCharSet = document.getElementById('unicodeCharSet');

    var cCode;
    var cPoint;
    for (var c = start; c < end; c++)
    {
     try
     {
      cCode = String.fromCharCode(c);
     }
     catch (e)
     {
      cCode = 'fromCharCode max val exceeded';
     }
     
     try
     {
      cPoint = String.fromCodePoint(c);
     }
     catch (e)
     {
      cPoint = 'fromCodePoint max val exceeded';
     }
     
     tr[c] = '<tr><td>' + c + '</td><td>' + cCode + '</td><td>' + cPoint + '</td></tr>'
    }
    unicodeCharSet.innerHTML = tr.join('');
   }
   
   function startRender()
   {
    setTimeout(renderCharCodeTable, 100);
    console.time('renderCharCodeTable');
   }
   unicodeCharSet.addEventListener("load",startRender());
body
  {
   margin-bottom: 50%;
  }
  
  form
  {
   position: fixed;
  }
  
  table *
  {
   border: 1px solid black;
   font-size: 1em;
   text-align: center;
  }
  
  table
  {
   margin: auto;
   border-collapse: collapse;
  }
  
  td:hover
  {
   padding-bottom: 1.5em;
   padding-top: 1.5em;
  }
  
  tbody > tr:hover
  {
   font-size: 5em;
  }
 
 <form>
  Start Unicode: <input type="number" id="start" value="0" onchange="renderCharCodeTable()" min="0" max="4294967300" title="Set a number from 0 to 4294967294" >
  <p></p>
  Show <input type="number" id="range" value="30" onchange="renderCharCodeTable()" min="1" max="1000" title="Range to show. Insert a value from 10 to 1000" > symbols at once.
  <p></p>
  <input type="button" id="pr" value="◄◄" onclick="prev()" title="Mostra precedenti" >
  <input type="button" id="nx" value="►►" onclick="next()" title="Mostra successivi" >
  <input type="button" id="ok" value="OK" onclick="startRender()" title="Ok" >
  <input type="reset" id="rst" value="X" onclick="startRender()" title="Reset" >
  
 </form>
 <table>
  <thead>
   <tr>
    <th>CODE</th>
    <th>Symbol fromCharCode</th>
    <th>Symbol fromCodePoint</th>
   </tr>
  </thead>
  <tbody id="unicodeCharSet">
   <tr><td colspan="2">Rendering...</td></tr>
  </tbody>
 </table>

Run it a first time, then open the code and set the start variable's value to a very high number just a little bit lower than MAXEND constant value. The following is what I obtained:

    code        equivalent symbol
{~~~ first execution output example ~~~~~}

0   
1   
2   
3   
4   
5   
6   
7   
8   
9   
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  !
34  "
35  #
36  $
37  %
38  &
39  '
40  (
41  )
42  *
43  +
44  ,
45  -
46  .
47  /
48  0
49  1
50  2
51  3
52  4
53  5
54  6
55  7
56  8
57  9
{~~~ second execution output example ~~~~~}
4294967275  →
4294967276  ↓
4294967277  ■
4294967278  ○
4294967279  ￯
4294967280  ￰
4294967281  ￱
4294967282  ￲
4294967283  ￳
4294967284  ￴
4294967285  ￵
4294967286  ￶
4294967287  ￷
4294967288  ￸
4294967289  
4294967290  
4294967291  
4294967292  
4294967293  �
4294967294  

The output of course is truncated (between the first and the second execution) cause it is too long.

After the 4294967294 (= 2^32) the function inexorably stops so I suppose that it has reached its max possible value: so I interpret this as the max possible value of the unicode char code table. Of course as said by other answers, not all char code have an equivalent symbols but frequently they are empty, as the example showed. Also there are a lot of symbols that are repeated multiple time in different points between 0 to 4294967294 char codes

Edit: improvements

(thanks @duskwuff)

Now it is also possible to compare both String.fromCharCode and String.fromCodePoint behaviors. Notice that the first statement arrives to 4294967294 but the output is repeated every 65536 (16 bit = 2^16). The last one stops working at code 1114111 (cause the list of unicode char and symbols starts from 0 we have a total of 1,114,112 Unicode code points but as said in other answers not all of them are valid in the sense that they are empty points). Also remember that to use a certain unicode char you need to have an appropriate font that has the corresponding char defined in it. If not you will show an empty unicode char or an empty square char.

enter image description here

Notice:

I have noticed that in some Android systems using Chrome Browser for Android the js String.fromCodePoint returns an error for all codepoints.

willy wonka
  • 1,440
  • 1
  • 18
  • 31
  • 3
    Unfortunately, the results of your function are incorrect. `String.fromCharCode` truncates its input to 16 bits; the upper 16 bits of input you're passing are ignored. –  Oct 17 '16 at 17:22
  • 1
    Oh, thanks for the notice. I didn't know. How do I correct it? – willy wonka Oct 17 '16 at 18:39
  • 1
    Use `String.fromCodePoint` instead. And don't try to go above code point 0x10FFFF. –  Oct 17 '16 at 20:14
  • tried locally: the routine generates an error after 1114111 and stops working. May I ask why? The error is: "Uncaught RangeError: Invalid code point [the processed number]" – willy wonka Oct 17 '16 at 22:12
  • 1
    1114111 is 0x10FFFF. Unicode defines that as the maximum codepoint, so attempting to use a higher codepoint will generate an error. –  Oct 17 '16 at 23:06
  • 1
    @duskwuff. Ok I've made some improvements: now it is possible to compare both String.fromCharCode and String.fromCodePoint – willy wonka Oct 18 '16 at 00:01
  • 1
    Can i make more improvements to my answer? – willy wonka Jan 06 '17 at 18:17
  • 1
    You can try, but I wouldn't count on it helping. An answer based on references to standards (like the accepted answer by Jonathan Leffler above) is far better than one based on observations and guesswork. –  Jan 06 '17 at 18:47
  • 1
    @duskwuff Agreed, nevertheless each answer clarifies some different aspects of the same problem. So I see them as complementary not as rivals. Anyway how can I improve my answer? – willy wonka Jan 12 '17 at 21:21
  • 2
    Good *alternative* answer. You don't deserve the downvotes. You put a lot of time into this answer. I wish more people would see that. – Jack G Apr 19 '20 at 22:24