3

According to the specification the characters [#x10000-#xEFFFF] are legal in XML names. However, the W3 validator says that this XML is not well-formed:

<?xml version="1.0"?>
<>value</>

(the name of the attribute is a Unicode character #x10400). Some browsers, like Firefox, also complain about it (Chrome displays XML, IE shows a blank page). Is it an error in tools or the XML is really not well-formed?

pkalinow
  • 1,619
  • 1
  • 17
  • 43
  • While I couldn't find the answer to your question, what I can say is that it really does not seem like a good idea to use such characters for XML, as, quoting the specification : "Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. " XML is made for natural language, why would you use this? – Azaghal Aug 12 '16 at 15:10
  • "" is just an example. It's a letter from Deseret alphabet so someone can make meaningful names from such letters. I don't know if anyone really uses that alphabet, but it is not impossible. – pkalinow Aug 16 '16 at 07:09

2 Answers2

2

Is it an error in tools or the XML is really not well-formed?

It's well formed in the latest specification, which is XML 1.0 Fifth Edition. But it was not well-formed in the previous edition, which was current until 2008.

The original XML 1.0 spec (from 1998) locked down the set of name characters to the characters that were defined as letters in the Unicode standard of the time. That didn't include which only came in with Unicode 3.1 a few years later.

XML 1.1 was much looser about what characters it would accept in names (largely for this reason, to allow characters from future Unicode versions), and this is a Good Thing. However XML 1.1 has never really caught on, so the Editors decided to backport the newer, more permissive namechar rules from there to 1.0. This was controversial and all in all probably not a Good Thing.

This means you can use in names in XML 1.0 documents and be usable by a subset of parsers that have updated for Fifth Edition (or never implemented the strict rules in the first place), or you can use them in XML 1.1 documents and be usable by a different set of parsers that support XML 1.1.

Or, more realistically, you can avoid those characters which are sort-of-well-formed-depending altogether, and feel a little sad.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Good to know about such a difference between XML editions. It seems strange to me that one XML 1.0 may be incompatible with another XML 1.0... – pkalinow Aug 16 '16 at 14:22
1

Yes, supplementary characters are allowed in XML names.

Your XML is well-formed because the element name uses characters allowed by the Name production in the W3C XML Recommendation.

However:

  • Online validators that get the file from you over HTTP will have to take care to mind the character encoding. It appears that by the time the W3C Markup Validation Service gets your XML, your character is getting lost in an encoding shuffle:

    Warning Missing "charset" attribute for "text/xml" document.

    The HTTP Content-Type header (text/xml) sent by your web browser (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36) did not contain a "charset" parameter, but the Content-Type was one of the XML text/* sub-types.

    The relevant specification (RFC 3023) specifies a strong default of "us-ascii" for such documents so we will use this value regardless of any encoding you may have indicated elsewhere.

    If you would like to use a different encoding, you should arrange to have your browser send this new encoding information.

    Try an offline XML parser. My Xerces-J-based validator, for example, correctly identifies your XML as being well-formed.

  • Be aware that not all characters allowed by NAME are allowed in NCNAMEs. So, although well-formed, XML using such characters cannot be valid according to an XSD where such names are not allowed.

kjhughes
  • 106,133
  • 27
  • 181
  • 240