1

What encoding should the XML prologue itself be (and why)

For example should

<?xml version="1.0" encoding="big5" ?>

itself be encoded in big5?


Question inspired by How to parse non-UTF8 XML in browsers with Javascript?, where the poster has the XML prologue / declaration encoded in big5.

Community
  • 1
  • 1
Michal Charemza
  • 25,940
  • 14
  • 98
  • 165

2 Answers2

2

It is not possible to encode ASCII in Big5.

Big5 is purely a double-byte character set. To allow intermixing of single-byte character sets, all Big5 2-byte character encodings have the high-order bit set. The standard never specified WHICH SBCS was to be used, and the de-facto standard is ASCII, which can unambiguously be distinguished because all ASCII characters have the high-order bit clear.

Put another way, Big5 contains no 2-byte encodings corresponding to the standard ASCII character set, so the only way to include an XML prologue and tag delimiters is to use ASCII characters.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
1

The XML declaration must be in the same encoding as the rest of the document. If the document is in Big5 the XML declaration must be in Big5.

What this means for an XML parser is that it must have a list of supported encodings and must try them in turn until it finds one where the result of decoding the first 20 or 30 bytes in the file is a valid XML declaration with the right encoding label.

Of course this strategy can be optimized: if the first few bytes of the file are <?xml in ASCII, then this reduces the set of possibilities.

XML parsers aren't obliged to support any encodings other than a small minimum set such as UTF-8 and UTF-16.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164