Understanding Unicode: Surrogate Blocks, Noncharacters

Question

I am trying to actually understand the unicode standard and was poking through the xml spec where it reads:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Now I have a couple of questions:

What are the surrogate blocks? Are they the UTF-16 codes that indicate a 4 byte code point?
Does #xXXXX refer to the code point or to the UTF-16 encoded value here?
If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here? Isn't it the task of an encoding to hide those encoding-related details from the space the encoding maps from?
Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.

Thanks for clarification!

Are you asking about the Unicode Standard or the W3C's XML Specification? — 一二三, Apr 30 '16 at 11:29
About the Unicode Standard in context of the XML Specification ;) The 2nd question refers to the notation used in the XML Specification, however I want to understand the role of Unicode here. So far I thought that Unicode describes the set of all known symbols (and gives them a number) and that encodings like UTF-8 describe a mapping from an unicode character stream to a byte stream (and vice versa). But then I read this xml spec that confused me. — Henning, Apr 30 '16 at 11:41
You're more likely to get answers if you only ask a single question. — nwellnhof, Apr 30 '16 at 13:00
Hmm, ok, but all these questions are highly related. @LưuVĩnhPhúc: The article you mentioned states: > The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. Thus, the xml spec is using UTF-16 encoded values to describe xml? Otherwise, mentioning surrogate blocks does not make much sense. Why are they doing that? — Henning, Apr 30 '16 at 13:08

score 8 · Accepted Answer · answered May 03 '16 at 21:01

What are the surrogate blocks?

Unicode codepoints in the U+D800 to U+DFFF range, inclusive, which are reserved for exclusive use as UTF-16 surrogates and are illegal in any other context.

Are they the UTF-16 codes that indicate a 4 byte code point?

Yes.

Does #xXXXX refer to the code point or to the UTF-16 encoded value here?

The actual Unicode codepoints. Considering that the definition of Char includes values > #xFFFF, which individual encoded UTF-16 values cannot exceed. UTFs are byte encoding schemes for codepoint values. The XML spec is written in terms of codepoints, not encodings. An XML document can be encoded in any charset specified in the "encoding" attribute of the XML prolog, for purposes of storage and transmission, but the actual XML content is processed in terms of unencoded codepoints.

If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here?

The surrogate codepoints are reserved and not allowed to appear unencoded in any textual content. The Char definition is simply enforcing that rule.

Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.

Because the encoding is not always known ahead of time, and may have to be detected dynamically. U+FFFE is used as a BOM marker to help facilitate that. Early versions of Unicode allowed U+FFFE to be used as either a BOM or an actual non-breaking space character within textual content. That lead to ambiguity at times. So newer versions of Unicode reserve U+FFFE strictly as a BOM only, and non-breaking spacing is handled by U+2060 WORD JOINER instead to avoid any ambiguity.

That being said, in the context of XML, it doesn't make sense to use U+FFFE in any textual content. The entire document is encoded in a particular charset, and any BOM used would have to appear before the XML prolog. The XML spec defines BOM handling and charset detection outside of the XML document itself. So that is why the Char definition excludes U+FFFE.

U+FFFF is reserved and is not intended to ever be used in real content in practice. So that is why the Char definition excludes it.

So basically the Char definition allows all Unicode codepoints minus restricted codepoints.

Thanks for pointing out that these noncharacters were defined to simplify the encoding handling! — Henning, May 04 '16 at 09:46

Understanding Unicode: Surrogate Blocks, Noncharacters

1 Answers1