1

Using the following characters: " & ' < > £ for testing. My code builds an XML file using PHP and DOMDocument.

<?php

 $xml = new DOMDocument();
 $xml->formatOutput = true;
 $root = $xml->createElement('Start_Of_XML');
 $xml->appendChild($root);

 $el = $xml->createElement($node,htmlspecialchars(html_entity_decode($value[$i],ENT_QUOTES,'UTF-8'),ENT_QUOTES,'UTF-8'));               
 $parent->appendChild($el);

?>

The htmlspecialchars() method above converts these chars to:

" &amp; ' &lt; &gt; £

resp. That is, the double quote, apostrophe and pound sign fail to get encoded.

If I adjust the code to use htmlentities() instead:

<?
 $el = $xml->createElement($node,htmlentities(html_entity_decode($value[$i],ENT_QUOTES,'UTF-8'),ENT_QUOTES,'UTF-8'));

?>

The chars get parsed as :

" &amp; ' &lt; &gt; &pound;

So the pound sign gets converted along with the rest, but again the quote and apostrophe fail to get encoded when the XML is saved.

After searching through several posts I'm at a loss to find a solution?

Edit:

Using Gordon's answer as a basis I got the results I was looking for using something along the lines of https://3v4l.org/ZksrE

Great effort from ThW though. Seems pretty comprehensive. I'm going to accept this as a solution. Thanks.

cookie
  • 2,546
  • 7
  • 29
  • 55
  • It seems `createElement` is conveniently recognising `"` and `'` for you and converting them back to their original quotes: https://3v4l.org/qof5l – Nick Feb 14 '19 at 12:22
  • Great! How do I re-convert them please? – cookie Feb 14 '19 at 12:33
  • I tried using `createTextNode()` but that proves fruitless. https://3v4l.org/WMBfW Are you able to give me a nudge please? – cookie Feb 15 '19 at 12:25
  • Hang on, I think Gordon's answer may solve it https://stackoverflow.com/questions/2822774/php-is-htmlentities-sufficient-for-creating-xml-safe-values – cookie Feb 15 '19 at 13:37

1 Answers1

2

The second argument of DOMDocument::createElement() is broken - it only escapes partly and it is not part of the W3C DOM standard. In DOM the text content is a node. You can just create it and append it to the element node. This works with other node types like CDATA sections or comments as well. DOMNode::appendChild() returns the appended node, so you can nest and chain the calls.

Additionally you can set the DOMElement::$textContent property. This will replace all descendant nodes with a single text node. Do not use DOMElement::$nodeValue - it has the same problems as the argument.

$document = new DOMDocument();
$document->formatOutput = true;
$root = $document->appendChild($document->createElement('foo'));
$root
   ->appendChild($document->createElement('one'))
   ->appendChild($document->createTextNode('"foo" & <bar>'));
$root
   ->appendChild($document->createElement('one'))
   ->textContent = '"foo" & <bar>';
$root
   ->appendChild($document->createElement('two'))
   ->appendChild($document->createCDATASection('"foo" & <bar>'));
$root
   ->appendChild($document->createElement('three'))
   ->appendChild($document->createComment('"foo" & <bar>'));

echo $document->saveXML();

Output:

<?xml version="1.0"?>
<foo>
  <one>"foo" &amp; &lt;bar&gt;</one>
  <one>"foo" &amp; &lt;bar&gt;</one>
  <two><![CDATA["foo" & <bar>]]></two>
  <three>
    <!--"foo" & <bar>-->
  </three>
</foo>

This will escape special characters (like & and <) as needed. Quotes do need to be escaped so they won't. Other special characters depend on the encoding.

$document = new DOMDocument("1.0", "UTF-8");
$document
   ->appendChild($document->createElement('foo'))
   ->appendChild($document->createTextNode('äöü'));
echo $document->saveXML();

$document = new DOMDocument("1.0", "ASCII");
$document
   ->appendChild($document->createElement('foo'))
   ->appendChild($document->createTextNode('äöü'));
echo $document->saveXML();

Output:

<?xml version="1.0" encoding="UTF-8"?> 
<foo>äöü</foo> 
<?xml version="1.0" encoding="ASCII"?> 
<foo>&#228;&#246;&#252;</foo>
ThW
  • 19,120
  • 3
  • 22
  • 44
  • When you say: "The second argument of DOMDocument::createElement() is broken - it only escapes partly and it is not part of the W3C DOM standard." Are you referring to the naive or faulty use of `htmlentities()` to encode special chars? – cookie Feb 16 '19 at 09:29
  • He's referring to using `DOMDocument::createElement()` to create the tag and populate the value at the same time. It's a [non-standard feature](https://developer.mozilla.org/en-US/docs/Web/API/Document/createElement) and it isn't implemented correctly in PHP. – Álvaro González Feb 16 '19 at 10:24
  • 1
    It will escape `<` and `>`, so it treats it as text, but it still expects you to encode `&` yourself or will expect an entity reference. That is the behavior for a XML fragment. `htmlentities()` and `htmlspecialchars()` are string functions and not related to DOM. You should not need them if you use the XML APIs. – ThW Feb 16 '19 at 10:26