3

I don't know what "noncharacter" characters are. They are forbidden unicode characters, though I can copy and paste them, like U+FFFF (). If a character has a fixed position in Unicode, and can be used to display something, then:

  1. Why are those characters "noncharacter"?
  2. What is the point of classifying them as not a character, as they hold a position on a table and can be displayed (though as a replacement character) in HTML and CSS, even?
  3. What's the point in having so many empty spaces in Unicode, like in the "Specials" (FFF0-FFFF) block?
  • https://www.unicode.org/faq/private_use.html – dave_thompson_085 Mar 29 '21 at 21:20
  • 2
    [Private-Use Characters, Noncharacters & Sentinels FAQ](https://www.unicode.org/faq/private_use.html#:~:text=In%20Unicode%201.0%20the%20code,those%20early%20annotations%20and%20labels.) – JosefZ Mar 29 '21 at 21:50

1 Answers1

2

The Specials block isn't empty. Several of the elements in that block are assigned. Most famously (and importantly), REPLACEMENT CHARACTER (U+FFFD) is in that block. And while it's not technically a character, or in the Specials block, the very important sequence "FFFE" (little-endian BOM) can appear at the beginning of files, so it's useful that U+FFFE not be an otherwise legal character. (The related U+FEFF is technically a character, but its use as a character is deprecated.) If new "specials" are needed, there are several slots still available for them, while staying within that block.

Unicode prefers to group like-things together into blocks with convenient power-of-two sizes, and so there wind up being some left-over values at the end of various blocks that aren't currently assigned. The total Unicode space is over a million code points. Fewer than 300k have been allocated, so there's a lot of room to keep thing tidy.

The official non-characters (the xFFFE and xFFFF of each plane, plus FDDO-FDEF) leave room for "special uses" of byte sequences that you know will never be a character. The BOM is the most famous of these uses, but implementations can use them for other purposes if desired. All told, there are 66 of them out of a million code points, so it's not big cost to offer some future flexibility.

Rob Napier
  • 286,113
  • 34
  • 456
  • 610
  • 1
    FYI, _BOM_ is `U+FEFF` (Zero Width No-Break Space) which can appear almost anywhere in a text (and declares endianness only at the beginning of a file)… `U+FFFE` is nothing… – JosefZ Mar 29 '21 at 21:44
  • That's what I meant by "is technically a character, but its use as a character is deprecated." For details see "What should I do with U+FEFF in the middle of a file?" https://unicode.org/faq/utf_bom.html#bom6 While "nothing" is a reasonable way to think of U+FFFE, it is defined and is decodable. It just maps to "non-character." See https://util.unicode.org/UnicodeJsps/character.jsp?a=FFFE (specifically the property noncharacter_code_point). – Rob Napier Mar 29 '21 at 22:00
  • BTW, thanks for the FAQ link in the comments above. That's a really helpful page for this question. – Rob Napier Mar 29 '21 at 22:07