56

U+FFFE needs to be a noncharacter in order to allow the Byte Order Mark to work.

U+FFFF is described in The Unicode Standard as "useful for internal purposes as sentinels". Makes sense.

But I can't figure out, and The Unicode Standard doesn't really explain, why the set of noncharacters includes some random block within "Arabic Presentation Forms-A". What are these for? (Besides the eye of the basilisk?)

dan04
  • 87,747
  • 23
  • 163
  • 198

2 Answers2

54

OK the question is "what are they for" and "Why are they in the middle of the Arabic Presentation Forms".

Therefore it was agreed that these codepoints, which were never going to be used otherwise, would be designated noncharacters so they could be used internally by applications/programmers.

Ben
  • 34,935
  • 6
  • 74
  • 113
19

These noncharacters are for internal use by application and should not be interchanged.

I tried to explain based on what is said in Unicode standard.

Unicode got 66 non-characters. For all 17 planes they have two each, last two code points of the plane ending with FFFE FFFF. 32 other no-characters are continuous block U+FDD0 to U+FDEF.

So total count

 17*2 + 32 = 66

Read following text from the unicode chapter 16, which says that its in some random place because of "historic reason", I'm curious but I don't think there is any ambiguity.

For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not "Arabic noncharacters" or "right-to-left noncharacters," and are not distinguished in any other way from the other noncharacters, except in their code point values

U+FEFF is BOM and U+FFFE is byte-swapped version of it. But since U+FFFE is a noncharacter, when an interpreting process finds U+FFFE as the first character, it signals either that the process has encountered text that is of the incorrect byte order or that the file is not valid Unicode text, It just gives a signal, not a standard way. It can be either of the one, reverse bytes or a wrong text.

In the Unicode section 3.2 clause C2 says

C2 A process shall not interpret a noncharacter code point as an abstract character.

  • The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.

So as application developers you are free to use these characters as you wish. They are used as sentinel or delimter or may be some baslik characters, but they should not be interchanged.

Section 16.7 says

In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses

Again U+FFFF is not reserved as sentinel by Unicode standard but just given the typical use case. Read in section 16.7

U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16 U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16 This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on

一二三
  • 21,059
  • 11
  • 65
  • 74
Zimbabao
  • 8,150
  • 3
  • 29
  • 36
  • 1
    No, I didn't say that U+FFFE is the BOM. I said that U+FFFE's noncharacter status is necessary to allow U+FEFF to be the BOM. And you haven't answered my question at all, but just copy/pasted the very same link I included. – dan04 Mar 05 '11 at 06:35
  • Ok. Sorry I got you wrong on BOM part, I'll edit that. But it says that its intended peculiarity, not by standard. It says it can be used as "strong signal" and not as sure shot way of determining. And It says it reserved other points for application to decide how to use it. – Zimbabao Mar 05 '11 at 06:50
  • Well, I can see uses for internal sentinel noncharacters. But Unicode 3.0 already had 34 of them. Why the need to add another 32 in 3.1? – dan04 Mar 05 '11 at 07:19
  • 1
    Ok. If understand correctly your question is what is that "Historic reason" for reserving 32 new characters?. No explanation can be found in Unicode spec. I guess it may be because they felt need of more non-characters in BMP, as BMP had only two of them and its most used plane and both already had some use. One such example I can think of is Java application. – Zimbabao Mar 05 '11 at 08:20