4

I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.

Suppose we have the following code

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

foreach($node_list as $node) {
    //do something
}

If the code in the loop is something like

$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);

it works fine. But if it's more like

$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;

and $text contains some decoded &, it doesn't get encoded, even if one does nothing with $text at all.

At the moment, I apply htmlspecialchars on $text before I set $node->nodeValue to it. Now I want to know

  1. if that is sufficient,
  2. if not, what would suffice,
  3. and if there are more elegant solutions for this, as in the case of attribute manipulation.

The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.


EDIT

It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);

$doc = new DOMDocument();
$doc->loadXML($output);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');

foreach($node_list as $node) {
        $node->nodeValue = $node->textContent;
}
echo $doc->saveXML();

If I execute this code on the CLI with

php beeb.php |egrep 'link|Warning'

I get results like

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>

which should be

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>

(and is, if the loop is omitted) and according warnings

Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15

When I apply htmlspecialchars to $node->textContent, it works fine, but I feel very uncomfortable doing that.

Community
  • 1
  • 1
Percival Ulysses
  • 1,133
  • 11
  • 18
  • Pretty good question. Just by reading it I found the solution to my similar problem. Thank you. – zVictor Jan 08 '14 at 09:48

2 Answers2

8

Your question is basically whether or not setting DOMText::nodeValue to an XML encoded string or to a verbatim string.

So let's just try that out and set it to & and '&amp; and see what happens:

$doc = new DOMDocument();
$doc->loadXML('<root>*</root>');

$text = $doc->documentElement->childNodes->item(0);

echo "Before Edit: ", $doc->saveXML($text), "\n";

$text->nodeValue = "&";

echo "After Edit 1: ", $doc->saveXML($text), "\n";

$text->nodeValue = "&amp;";

echo "After Edit 2: ", $doc->saveXML($text), "\n";

The output then is as the following (PHP 5.0.0 - 5.5.0):

Before Edit: *
After Edit 1: &amp;
After Edit 2: &amp;amp;

This shows that setting the nodeValue of a DOMText-node expects a UTF-8 encoded string and the DOM library encodes the XML reserved characters automatically.

So you should not apply htmlspecialchars() onto any text you add this way. That would create a double-encoding.

As you write you experience the opposite I suggest you to execute an isolated PHP example on the commandline / within your IDE so that you can see exactly the output. Not that your browser renders this as HTML and then you think the reserved XML characters have not been encoded.


As you have pointed out you're not editing a DOMText but an DOMElement node. It works a bit different, here the & character needs to be passed as entity &amp; instead of verbatim , however only this character.

So this needs a little bit more work:

  1. Read out the text-content and turn it into a DOMText node. Everything will be perfectly encoded.
  2. Remove the node-value of the element node so it's empty.
  3. Append the DOMText node form first step as child.

And done. Here your inner foreach modified showing this:

foreach($node_list as $node) {
    $text = $doc->createTextNode($node->textContent);
    $node->nodeValue = "";
    $node->appendChild($text);
}

For your concrete example albeit I must admit I don't understand why you do that because this does not change the value so it wouldn't need this.

Tip: In PHP DOMDocument can open this feed directly, you don't need curl here:

$doc = new DOMDocument();
$doc->load("http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Thank you for your answer. I tested your code and it works fine, however it still doesn't work in my own example. I will modify my question accordingly. – Percival Ulysses Jun 27 '13 at 13:26
  • 1
    Well the difference is, that in my answer I set the value of a `DOMText` but you in your case are setting the value of a `DOMElement`. That was indeed not clear to me from your original question, but I can see it now. When setting `DOMElement::$nodeValue` it needs to be XML-encoded for the `&` sign, `<` and `>` will be automatically turned into `<` and `>` respectively. So indeed using `htmlspecialchars()` on these would be wrong, I edit the answer to show you what you can do instead. – hakre Jun 27 '13 at 16:16
  • Thank you, that's it! I knew I was missing something since I'm quite new to DOM (and PHP, for that matter). What I will do is something like `$text = $doc->createTextNode(f($node->textContent));` where `f` modifies the text according to my wishes, I chose `f = id` only for illustration purposes. – Percival Ulysses Jun 27 '13 at 16:40
  • Yes I thought that might be the case but was not sure and I didn't wanted to let it got unnoticed. Sometimes when we're coding we look into totally the wrong places, so some clear and open words are often helpful to keep things aligned. See as well: [Generating XML document in PHP (escape characters)](http://stackoverflow.com/questions/3957360/generating-xml-document-in-php-escape-characters) – hakre Jun 27 '13 at 18:03
3

As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard. To illustrate this, an example:

$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');

$s = 'text &amp;&lt;<"\'&text;&text';

$root = $doc->documentElement;

$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);

$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);

$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);

echo $doc->saveXML();

outputs

Warning: DOMDocument::createElement(): unterminated entity reference            text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
  <tag1>text &amp;&lt;&lt;"'&text;</tag1>
  <tag2>text &amp;amp;&amp;lt;&lt;"'&amp;text;&amp;text</tag2>
  <tag3><![CDATA[text &amp;&lt;<"'&text;&text]]></tag3>
</root>

In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath     = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

$visitTextNode = function (DOMText $node) {
    $text = $node->textContent;
    /*
        do something with $text
    */
   $node->nodeValue = $text;
};

foreach ($node_list as $node) {
    if ($node->nodeType == XML_TEXT_NODE) {
        $visitTextNode($node);
    } else {
        foreach ($node->childNodes as $child) {
            if ($child->nodeType == XML_TEXT_NODE) {
                $visitTextNode($child);
            }
        }
    }
}
Community
  • 1
  • 1
Percival Ulysses
  • 1,133
  • 11
  • 18
  • @hakre Thank you again, that looks neater, and more flexible; i.e. I guess that one could put the outer loop into a function, and have the `$visitTextNode` function as an argument to this new function? Would be a nice iterator over the `DOMText` nodes then. – Percival Ulysses Aug 16 '13 at 21:51
  • Well, that anonymous function (yes, you can pass it easily as parameter in PHP) is the vistor, the iterator (probably traversal is more precise in your case) is the code where you traverse. Iteration is most often part of a traversal. Having a visitor is a good companion of a traversal. So no big magic with my edit, just to show how you can reduce the code *and* gain more flexibility. The traversal can be improved as well but I didn't :) – hakre Aug 16 '13 at 22:24
  • 1
    This kind behavior of `nodeValue` indeed has mistakenly been reported as a bug https://bugs.php.net/bug.php?id=31613 – nggit Apr 28 '17 at 03:26