0

I have this xml (had to cut/paste via HTML).

<tr>
    <td>http://www.example.co.uk/the-view-from-22/feed/</td>
    <td>Example Blogs » The View from 22 » Example Blogs</td>
    <td>http://blogs.example.co.uk/</td>
    <td><![CDATA[Listen: The Example&rsquo;s verdict on the debate]]></td>
    <td>http://blogs.example.co.uk/coffeehouse/2015/04/podcast-special-the-debate/</td>
</tr>

It is being loaded in to an XML dom document

   $dom = new DOMDocument();
   $dom->preserveWhiteSpace = false;
   $dom->formatOutput = true;
   $dom->loadXML( $xml->asXML() );
   return $dom->saveXML();

But this throws an error about the &rsquo; entity not being defined.

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Entity 'rsquo' not defined in Entity,...

As it is in a CDATA section I expected the DOMDocument to treat it as text and ignore it... but it doesn't... is there a way around this?

The Data is being pulled directly out of a mysql database in a view, so there isn't much scope for 'fixing it up' first - I added the CDATA in the select clause for the view, that was my attempt at a fix!

edit Traced it back following suggestions below (cheers!)

The data is being loaded using $xml->addChild( $key, $value ) but the $value is in the form so is being encoded as you surmised.

So am just trying this...

How to write CDATA using SimpleXmlElement?

And it works - I am now loading up the orignal doc with:-

 if (strpos(strtoupper($value),'<![CDATA[') === 0 && strpos(strrev($value),'>]]') === 0) {
                $child = $xml->addChild( $key );
                $node = dom_import_simplexml($child);
                $no   = $node->ownerDocument;
                $node->appendChild($no->createCDATASection(substr($value,9,strlen($value)-12)));

                //simple key/value child pair
            } else {
                $xml->addChild( $key, $value );
            }
Community
  • 1
  • 1
pperrin
  • 1,487
  • 15
  • 33
  • 1
    Please provide the exact error message as well. And why are you using `$xml->asXML()`? If that is dropping the CDATA sequence it might create invalid XML (yes that is possible; I suspect you're having a SimpleXMLElement here). – hakre Apr 20 '15 at 19:56
  • Hi, added the actual error message! The end result is blank... – pperrin Apr 20 '15 at 20:11
  • Please also add the output of `$xml->asXML()` to the question. I suspect it's different from the one you've posted (that one is probably from your database directly). Could it be? – hakre Apr 20 '15 at 20:31
  • 1
    Also unable to reproduce: http://3v4l.org/K6LkS – hakre Apr 20 '15 at 20:44
  • You need to add the verbatim XML as well to your question, not only the one you copied as your browser displays it. Check "view source" in your browser and locate the "XML". – hakre Apr 21 '15 at 08:11
  • Hi have traced it back - double encoding as was surmised. If you update your answer from your comments I'll accept it - and I'll add the final solution including getting the CDATA in correctly to the original question for the sake of completeness. Cheers! – pperrin Apr 21 '15 at 09:31

2 Answers2

0

You could try to replace it, if it's only one &rsquo; and not a mass of special chars.

 $dom = new DOMDocument();
 $dom->preserveWhiteSpace = false;
 $dom->formatOutput = true;
 $xml = $xml->asXML();

 $xml = str_replace('&rsquo;', '&#8217;', $xml);

 $dom->loadXML($xml);
 return $dom->saveXML();

The real question is, how did the &rsquo; get into your database. Fix it, before insertion...then you can pull well-formed XML. https://stackoverflow.com/a/3142636/1163786

Or make rsquo a valid entity:

<!DOCTYPE ROOT_XML_ELEMENT [ <!ENTITY rsquo "&#8217;"> ]>

If your content is UTF-8 simply replace it with: `


(I think) the original issue is this one:

Warning: Entity 'rsquo' not defined in Entity, line: ...

<?php

$xml = <<<XML
<tr>
    <td>Listen: The Example&rsquo;s verdict on the debate></td>
</tr>
XML;

$doc = new DOMDocument();
$doc->presverWhitespace = false;
$doc->formatOutput = true;
$doc->loadXML($xml);
echo $doc->saveXML();

Because entity 'rsquo' is not valid XML the error pops up. Now pperrin addressed it by adding a "CDATA fix". That's how i understand the question.

You don't need CDATA - if you

  • define the entity at the root or
  • add it to the DTD to make it valid or
  • replace it manually (see above)
  • or simply fix it before it goes to the database.
Community
  • 1
  • 1
Jens A. Koch
  • 39,862
  • 13
  • 113
  • 141
  • Hi the entity is valid - its some HTML I am passing aorund via the database and via XML documents (or would be if it was working!!) – pperrin Apr 20 '15 at 20:14
  • I doubt that the entity is valid. Your XML file contains a entity reference to the entity "’" but that entity isn't declared anywhere (for example a DTD). And so you get: error undefined entity. – Jens A. Koch Apr 20 '15 at 20:16
  • @JensA.Koch: Within CDATA there is no such entity, just character data, so nothing to choke for DOMDocument normally. I suspect the problem is with `$xml->asXML();` already. Using the XML verbatim would then be the answer. But OP yet has not shared why there is the SimpleXMLElement in the first place. – hakre Apr 20 '15 at 20:35
  • I understand this a bit differently, because not all entites are known and valid within a CDATA section.. http://www.w3.org/TR/REC-xml/#NT-Char – Jens A. Koch Apr 20 '15 at 20:47
  • Let's go through this: "&": Check,. OK! "#": Check, OK! - "8": Check, OK! - "2": Check, OK! - "1": Check, OK!, "7": Check, OK! - ";": Check, OK! -- result: all those seven characters are valid characters in the document encoding (US-ASCII assumed). Better compare here https://en.wikipedia.org/wiki/CDATA#Issues_with_encoding --- most likely the user has a double encoding. – hakre Apr 20 '15 at 21:10
  • Sorry, i mixed the end of the sentence up. "not all entites are known and valid within a XML document". "’" is not valid. If you replace it with "’" you get the same char, but now it's valid XML. – Jens A. Koch Apr 20 '15 at 21:50
  • Jens, yes, that is generally true for XML document, but OP wrote about CDATA and within CDATA the entity notation is not available, instead those characters are taken verbatim. But OP seems to have a double encoding that causes the issue, so decoding first and then parfsing the XML might work. – hakre Apr 21 '15 at 06:13
0

As I have demonstrated with my example code I was not able to reproduce your problem. Therefore I came to the conclusion that you must have a double-encoding and the double-encoded data is where the XML parser chokes and rightfully gives you the warnings. It's just that due to the double encoding this was not immediately visible.

Decode the data once so that it is properly XML encoded. Then DOMDocument can easily load it.


Old Answer (can still be useful for users coming here via search engines):

I suspect your problem is with $xml->asXML() as the CDATA section does not produce that error.

There is a better way to convert into a DOMDocument first:

$dom = dom_import_simplexml($xml)->documentElement;

This should also preserve the encoding with the CDATA section (not 100% sure). For your formatting it might then be you need to reload the document but maybe you don't need. Try

$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$result = $dom->saveXML();

If the result is not yet the expected pretty-print format you're looking for, you can reload the document from dom:

$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadXML($dom->saveXML());
$result = $dom->saveXML();

I hope as this is DOMDocument, there is not problem with the previous CDATA encoded characters that resemble an entity.

The conversion function dom_import_simplexml() is in the manual and as SimpleXML and DOM share commons behing their interfaces, using it should be the preferred way if you want to switch between DOM and SimpleXML or vice-versa.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • As per your earlier comment - the asXML() is encoding the CDATA tags(!) – pperrin Apr 20 '15 at 21:42
  • So it shows <![CDATA[... ! I changed to your suggested code but now get Fatal error: Call to undefined method stdClass::saveXML() – pperrin Apr 20 '15 at 21:55
  • 1
    @pperrin: Whatever you stored into your database is not XML. You have encoded it wrong before storing it into the database. You have to fix the data first. Most likely a double encoding. The code in your question shows too little to be sure of that and more importantly to actually give another suggestion than just fix your data. – hakre Apr 21 '15 at 06:15
  • The database is holding an arbitrary fragment of text (happening to be HTML) - I want transfer this fragment to another process via an XML document. So the field is read, wrapped in a CDATA and put in an XML document, XML has no business looking inside a CDATA (other than for the end of the CDATA). From your suggestion, I see the round trip into XML then out via 'saveXML' is messing up the CDATA - I will step through and see where I get - thanks. – pperrin Apr 21 '15 at 08:14