10

The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.

Does the Unicode Consortium Intend to make UTF-16 run out of characters?

i.e. make a code point > 0x10FFFF

If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.

Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?

GlassGhost
  • 16,906
  • 5
  • 32
  • 45
  • 3
    I happen to know a `utf8-loose` parser that accepts 13-byte code points. This is not unuseful. Obviously this process doesn’t give a fart about UTF-16, which is a very unfortunate legacy we’d all like to forget since it incorporates the worst disadvantages of both UTF-8 and UTF-32 without enjoying any of the advantages of either: UTF-16 is truly the worst of both worlds. But make no misake: any strict UTF-8 parser ***must*** reject code points over 4 bytes in encoded length. This is to kiss UTF-16’s sweet you know what. – tchrist Feb 22 '12 at 00:31
  • 1
    Wake me up when they discover a new civilization with a non-alphabetic writing system. – Hans Passant Feb 22 '12 at 01:09
  • 7
    @HansPassant **Time to wake up** Alphabets are just one of the forms that human writing take. There are also syllabaries and logograms. Bazillions of logograms. CJK Extension E is nearly ready, and that has 6,000 new characters in it — not one of which has anything to do with an “alphabet”. – tchrist Feb 22 '12 at 01:44
  • Actually it wouldn't even be impossible to extend UTF-16 in the same way that it itself was derived from UCS-2: by setting aside a range of code points outside the BMP as "extended surrogates", a sequence of which could then encode code points outside the current codespace. FWIW, even at the current rate the codespace might turn out to be enough for the next decades. – Philipp Feb 22 '12 at 23:32
  • @tchrist There's lots of unfortunate legacies we'd like to forget, but so long as UTF-16 is used in Windows and Java, it's a reality that many people have to acknowledge. Even if you're working in a pure UTF-8 environment, you're going to have to deal with programs that are built for real-world compatibility. Mangle encoding all you want inside your box, but outside that box, standard UTF-8 is the only UTF-8 anyone should see. – prosfilaes Aug 05 '12 at 08:32
  • @prosfilaes and @tchrist Logo-grams shouldn't be worthy of character status; I could understand adding any new math characters or a new safety/currency symbol, but ANOTHER version of a smiley face really worth adding to every single font library? as if you couldn't use the application specific code-points already? as for me ill just stick with "`;)`". Just make an svg file and imbed with an `img` tag. – GlassGhost Aug 19 '12 at 07:51
  • 1
    @GlassGhost By logograms tchrist meant Chinese characters. I don't believe anyone supports all Unicode characters; if you're making a font, feel free to exclude whatever characters you want. By sheer count, the few hundred emoji that were new to Unicode aren't that major, especially when compared to the tens of thousands of Chinese characters being encoded. – prosfilaes Aug 19 '12 at 21:36
  • @prosfilaes I know you can exclude, the point is some characters shouldn't be added to the standard. Also, I assume that people mean what they say. – GlassGhost Aug 21 '12 at 06:12
  • 1
    @GlassGhost He did say what meant; for example, the Encyclopedic Dictionary of Archaeology says "Writing systems that make use of logograms include Chinese, Egyptian hieroglyphic writing, and early cuneiform writing systems." – prosfilaes Aug 21 '12 at 12:02

4 Answers4

9

As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):

leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.

In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF

Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):

Community
  • 1
  • 1
GlassGhost
  • 16,906
  • 5
  • 32
  • 45
2

At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.

Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.

StilesCrisis
  • 15,972
  • 4
  • 39
  • 62
  • I dunno the Star trek buffs might get mad?? But shouldn't we still have room with that? I think 1,112,064 is a LOT of damn characters, I'm used to english and with ascii and all the math symbols and greek symbols I can think of we only have like 512. – GlassGhost Feb 21 '12 at 20:23
  • 3
    Sure, but basic Japanese at a high school level has several thousand. Chinese, more still. Some languages just have more glyphs than others. Still, I agree that one million glyphs ought to stretch a long way. – StilesCrisis Feb 21 '12 at 23:44
  • I also agree that one million glyphs ought to stretch a long way. – GlassGhost Feb 24 '12 at 15:48
  • 6
    @GlassGhost: Sure, and 640 kilobytes of memory is enough for anyone. – Keith Thompson Feb 24 '12 at 18:56
  • 1
    To be fair, human languages aren't affected by Moore's law--and thank goodness for that!! – StilesCrisis Feb 24 '12 at 19:49
  • @KeithThompson 640 kilobytes IS enough memory for Direct X 11. Then we will run out of characters before we meet an alien species. – GlassGhost Mar 01 '12 at 19:58
  • @StilesCrisis and the moment interstellar flights and terraforming become normal, is the moment languages start growing at a rate of `Θ(T^2)`; and with intergalactic flights: `Θ(T^3)`; and once arbitrary distance wormholes can be created: `Θ(c^T)|c>1`, thus effecting Moore's law to apply to human languages. – user3338098 Jul 07 '23 at 20:07
2

Cutting to the chase:

It is indeed intentional that the encoding system only supports code points up to U+10FFFF

It does not appear that there is any real risk of running out any time soon.

Perry
  • 4,363
  • 1
  • 17
  • 20
0

There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above U+10FFFF, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up to U+FFFFFFFF for over 4 billion characters.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 1
    That’s not exactly true. There are systems that use a modified version of the UTF-8 algorithm to allow for non-Unicode code points up to 2⁷²−1. So long as cooperating processes not pretend these so-called ‘hypers’ are actual Unicode code points or that that encoding is identical to UTF-8 (although it largely is), there’s nothing in the Standard that forbids them. And if you can’t think of anything creative, interesting, and useful to do with an extra 51 bits of namespace for characters, I certainly know people who can. And no, these people don’t give a rat sassy mamma about UTF-16. Who would? – tchrist Feb 22 '12 at 00:25
  • 3
    If a system is using a UTF-8 like encoding for non-Unicode values, then it is not really UTF-8, it is just a custom encoding that was inspired by UTF-8. The OP's question was specifically about standard UTF-8 and Unicode, and in that case what I wrote in my answer applies. – Remy Lebeau Mar 31 '15 at 02:34