2

My customer wants to write my xml file as <name>Smith & Jones</name>, not <name>Smith &amp; Jones</name>.

I can't find a quality reference discussing this.

parsifal
  • 338
  • 1
  • 5
David Thielen
  • 28,723
  • 34
  • 119
  • 193
  • If you know the answer, then **add** the answer! – Naftali Aug 27 '12 at 15:27
  • no thanks to the one that edited this question, the fact that we could not see any problem in it was a solution in itself ;). More seriously, see http://stackoverflow.com/questions/730133/invalid-characters-in-xml to learn that is not possible. – jolivier Aug 27 '12 at 15:27
  • @Neal As listed in the part of the question that was edited away, we have a customer asking this question who wants a 3rd part answer. So answering it myself would not have been accepted. – David Thielen Aug 27 '12 at 16:00

4 Answers4

7

From the XML specification (§2.4):

The ampersand character (&) and the left angle bracket (<) may appear in their literal form only when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. They are also legal within the literal entity value of an internal entity declaration; see "4.3.2 Well-Formed Parsed Entities". If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&" and "<" respectively.

Since this circumstance fits into none of the stated categories, it is illegal.

lonesomeday
  • 233,373
  • 50
  • 316
  • 318
  • What does the XML spec say that parsers should do if they encounter this? I can't work it out. Is is unspecified, i.e. parsers can do whatever they like? – Rich Sep 29 '16 at 10:56
  • 2
    @Rich this would not be a [well-formed document](https://www.w3.org/TR/REC-xml/#sec-well-formed). "[Violations of well-formedness constraints are fatal errors.](https://www.w3.org/TR/REC-xml/#sec-terminology)"; "Definition: An error which a conforming XML processor MUST detect and report to the application." – Joe Aug 14 '20 at 11:03
6

Use the CDDATA tag to insert these characters within the XML tags without XML parsing them:

<name>Smith & Jones</name>

becomes

<name><![CDATA[ Smith & Jones ]]></name>

this way you can also put plain html withing xml.

example: http://www.w3schools.com/xml/xml_cdata.asp

Maurice
  • 171
  • 1
  • 8
3

You can't, at least if you want to keep calling your file "XML". XML does not allow unescaped ampersands, and any conforming parser will reject a file with them as "improperly formed".

You can use CDATA, but that introduces its own ugliness, and most serializers don't generate CDATA by default.

parsifal
  • 338
  • 1
  • 5
  • Note to the people who upvoted me: thanks, but the OP's original post asked for a reference; it was over-edited by Neal. In the original context, *lonesomeday* has the correct answer. – parsifal Aug 27 '12 at 15:57
2

The XML specification is clear that this is not well-formed XML.

If you want to know WHY the spec was written that way, that's always a much harder question to answer. Sometimes (but not this time) Tim Bray's annotated version of the XML recommendation at http://www.xml.com/axml/testaxml.htm sheds some light. Sometimes (but not this time) the comments and other notes in the XML source of the specification at http://www.w3.org/TR/1998/REC-xml-19980210.xml are revealing. In the absence of such clues, it is useful to recall that the creators of XML were very anxious to preserve compatibility with SGML, and that they were generally disposed towards having parsers that could detect errors in the XML rather than making XML easy to author.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164