1

I am given an XML file which contains names like below:

<Benchↂ0020Codeↂ0020>something</Benchↂ0020Codeↂ0020>

The ↂ symbol is represented with three bytes: 0xE2, 0x86, 0x82.

It looks like ↂ0020 is supposed to be treated as space character. But when I read the XML using System.Xml.XmlReader the characters ↂ0020 are not converted to space.

Is there is a way to have them converted (besides replacing, of course)? Or I just got broken XML?

Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
  • https://stackoverflow.com/questions/3480887/how-to-include-space-in-xml-tag-element-which-gets-transformed-by-xslt-into-exce – L.B Oct 08 '17 at 17:58
  • According to the [XML specification](https://www.w3.org/TR/xml/#NT-NameStartChar) the character U+2182 is allowed, so the XML code looks valid. But its a weird name for an XML element. Check the source of your XML code, if it was generated this way or if you changed it afterwards somehow. – Progman Oct 08 '17 at 17:59
  • 1
    If they were spaces then that would be an invalid tag name. The XML isn't necessarily broken as `Benchↂ0020Codeↂ0020` is a valid tag name, why do you think they should be spaces? – user657267 Oct 08 '17 at 20:08
  • 1
    Looks like the XML isn't broken, but it's representing names using a private convention for escaping disallowed characters. The XML parser won't understand this convention, it's up to the receiving application to interpret it. – Michael Kay Oct 08 '17 at 21:33
  • @MichaelKay Please post your comment as the answer so I can accept it. – Bobrovsky Oct 09 '17 at 11:48
  • @Bobrovsky: Is there a reason why you ask for a comment to be posted as an answer rather than accept [my existing, extensive answer](https://stackoverflow.com/a/46635785/290085), which thoroughly explains that ↂ is not a space, that it is allowed, and that spaces are not allowed in XML names? If I've offended you my attempt to improve the searchability of your title and tagging for future readers, I apologize. – kjhughes Oct 09 '17 at 13:02
  • @kjhughes I think you misunderstood my question completely (and this is why your change to the title is something I think is not applicable). Your answer is a wiki-like answer for some other question. My question is about how to deal with the XML I am given. And about if it's possible at all. – Bobrovsky Oct 09 '17 at 13:52
  • @Bobrovsky: I fully understood and answered your question: You ask (1) *Is there is a way to have them converted (besides replacing, of course)?* and (2) *Or I just got broken XML?* My answer says (1) you cannot convert them because **Space characters are not permitted in XML names** and (2) **ↂ is permitted in XML names** so your XML is not broken. I both answer at a high level and provide low level details to support the answer. – kjhughes Oct 09 '17 at 14:16

2 Answers2

4

Space characters are not permitted in XML names

There are 86 codepoints whose name contain the word space. Ignoring the codepoints where SPACE hits due to MONOSPACE and any other that have a visual representation, leaves the following:

  • #x0020 SPACE
  • #x00A0 NO-BREAK SPACE
  • [#x2002-#x200A] EN SPACE through HAIR SPACE
  • #x205F MEDIUM MATHEMATICAL SPACE
  • #x3000 IDEOGRAPHIC SPACE

None of the space-related code points (empty visual representation) are permitted in XML names by the W3C XML BNF for component names:

NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] |
                  [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
                  [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
                  [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
                  [#x10000-#xEFFFF]
NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] |
                  [#x203F-#x2040]
Name          ::= NameStartChar (NameChar)*

Alternatives to spaces in XML names

  • CamelCase
  • underscore_char
  • hyphen-char
  • period.char

Colon should not be used as a word separator in XML names to avoid confusion with its use in XML Namespaces.


ↂ is permitted in XML names

The character, ↂ, (0xE2, 0x86, 0x82, which is #x2182), has nothing to do with spaces – it is ROMAN NUMERAL TEN THOUSAND. ↂ is explicitly permitted: #x2182 is in the [#x2070-#x218F] code range.

The 0020 appearing after ↂ are just digits. Together with the rest of the characters in Benchↂ0020Codeↂ0020, these form an allowed (albeit unconventional) XML name. They do not constitute spaces in the XML name as spaces are not allowed in XML names.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
1

The XML isn't broken, but it's representing names using a private convention for escaping disallowed characters. The XML parser won't understand this convention, it's up to the receiving application to interpret it.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164