4

I am working on modifying the contents of an XML file generated by some other library. I'm making some DOM modifications with PHP (5.3.10) and reinserting a replacement node.

The XML data I'm working with has " elements before I do the manipulation and I want to keep those elements as per http://www.w3.org/TR/REC-xml/ when I'm done with the modifications.

However I'm having problems with PHP changing the " elements. See my example.

$temp = 'Hello "XML".';
$doc = new DOMDocument('1.0', 'utf-8');
$newelement = $doc->createElement('description', $temp);
$doc->appendChild($newelement);
echo $doc->saveXML() . PHP_EOL; // shows " instead of element
$node = $doc->getElementsByTagName('description')->item(0);
echo $node->nodeValue . PHP_EOL; // also shows "

Output

<?xml version="1.0" encoding="utf-8"?> 
<description>Hello "XML".</description>

Hello "XML".

Is this a PHP error or am I doing something wrong? I hope it isn't necessary to use createEntityReference in every char location.

Similar Question: PHP XML Entity Encoding issue


EDIT: As an example to show saveXML should not be converting the &quot; entities just like the &amp; which behaves properly. This $temp string should really be output as it is initially entered with the entities during saveXML().

$temp = 'Hello &quot;XML&quot; &amp;.';
$doc = new DOMDocument('1.0', 'utf-8');
$newelement = $doc->createElement('description', $temp);
$doc->appendChild($newelement);
echo $doc->saveXML() . PHP_EOL; // shows " instead of element like &amp;
$node = $doc->getElementsByTagName('description')->item(0);
echo $node->nodeValue . PHP_EOL; // also shows " &

Output

<?xml version="1.0" encoding="utf-8"?>
<description>Hello "XML" &amp;.</description>

Hello "XML" &.
Community
  • 1
  • 1
user6972
  • 851
  • 1
  • 15
  • 32
  • [Maybe this is of some use?](http://stackoverflow.com/questions/17321770/dom-in-php-decoded-entities-and-setting-nodevalue) Interesting - I created a `new DOMText($temp);` as a text node then appended that to `$newelement` (an empty `` node, and the result I got was _almost_ right: `Hello &quot;XML&quot;.` – Michael Berkowski Feb 08 '15 at 21:48
  • @MichaelBerkowski That is interesting. If you used my string $temp which was already encoded, then your method double encoded it, but it did kept the encoding properly during saveXML. Can you describe more about what you're doing? I get a 'Invalid Character Error' when I try the DOMText. – user6972 Feb 09 '15 at 04:21
  • I don't see what's wrong with having double quotes unencoded in an element's node value? They get escaped only when inside attribute values. – Ja͢ck Feb 09 '15 at 04:21
  • @Ja͢ck the XML spec is for double quotes to be encoded inside any text node. – user6972 Feb 09 '15 at 04:27
  • Well, the spec only mentions `&` and `<` to require escaping in the contents; escaping of single and double quotes is only applicable in attributes. – Ja͢ck Feb 09 '15 at 04:30
  • possible duplicate of [DOMDocument::createElement(): unterminated entity reference](http://stackoverflow.com/questions/22956330/domdocumentcreateelement-unterminated-entity-reference) – ThW Feb 09 '15 at 14:47

1 Answers1

1

The answer is that it doesn't actually need any escaping according to the spec (skipping the mentions of CDATA):

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form (...) If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; " (...)

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

You can verify this easily by using createTextNode() to perform the correct escaping:

$dom = new DOMDocument;
$e = $dom->createElement('description');
$content = 'single quote: \', double quote: ", opening tag: <, ampersand: &, closing tag: >';
$t = $dom->createTextNode($content);
$e->appendChild($t);
$dom->appendChild($e);

echo $dom->saveXML();

Output:

<?xml version="1.0"?>
<description>single quote: ', double quote: ", opening tag: &lt;, ampersand: &amp;, closing tag: &gt;</description>
Community
  • 1
  • 1
Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • This is interesting because I assumed that the XML internals in WordPress were encoding correctly and I was trying to duplicate their XML when returning processing it with their code. I've run into bugs with PHP/entities that I assumed something was amiss and WordPress had it right. I will have to look into their code to see how they are producing these entities in their XML. For whatever reason it is causing me issues with how WP parses the XML I've modified. – user6972 Feb 09 '15 at 07:10
  • One quick question. When you mention encoding is 'only applicable in attributes' does this include embeded html attributes? For example if the text node has html tags with attributes in it? – user6972 Feb 09 '15 at 07:12
  • A text node can't have html tags, and as such the opening tag must be escaped. – Ja͢ck Feb 09 '15 at 07:13