The answer by ThW is overall thoughtful and the way to go. It explains well how the interface of XMLWriter
in PHP is meant to be used.
Credits go to him as well for a large fraction of the work done for this differentiated answer as we discussed the question yesterday in chat.
There are some constrains with CDATA in XML however that also applies to the outlined two ways of using XMLWriter for CDATA:
The string ']]>' cannot be placed within a CDATA section, therefore, nested CDATA sections are not allowed (well-formedness constraint).
From: CDATA Section - compare 2.7 CDATA Sections
Normally XMLWriter accepts string data that is not encoded for the use. E.g. if you pass some text, it will get written properly encoded (unless the bespoken XMLWriter::writeRaw
).
But if you start a CDATA section and then write text or you write CDATA directly, the string passed must not end nor cotain another CDATA section. That means, it can not contain the character sequence "]]>
" as this would end the CDATA section prematurely.
So the responsibility to pass valid data to XMLWriter remains to the user of those methods.
It is normally trivial to do so (single-octets, US-ASCII based character set binary encodings and UTF-8 Unicode), here is some example code:
/**
* prepare text for CDATA section to prevent invalid or nested CDATA
*
* @param $string
*
* @return string
* @link http://www.w3.org/TR/REC-xml/#sec-cdata-sect
*/
function xmlwriter_prepare_cdata_text($string) {
return str_replace(']]>', ']]]]><![CDATA[>', (string) $string);
}
And a usage example:
$xml = new XMLWriter();
$xml->openURI("php://output");
$xml->startDocument();
$xml->startElement("PostContent");
$xml->writeCDATA(xmlwriter_prepare_cdata_text('<![CDATA[Foo & Bar]]>'));
$xml->endElement();
$xml->endElement();
Exemplary output:
<?xml version="1.0"?>
<PostContent><![CDATA[<![CDATA[Foo & Bar]]]]><![CDATA[>]]></PostContent>
DOMDocument btw. does something very similar under the hood already:
$dom = new DOMDocument();
$dom->appendChild(
$dom->createElement('PostContent')
);
$dom->documentElement->appendChild(
$dom->createCdataSection('<![CDATA[Foo & Bar]]>')
);
$dom->save("php://output");
Output:
<?xml version="1.0"?>
<PostContent><![CDATA[<![CDATA[Foo & Bar]]]]><![CDATA[>]]></PostContent>
To technically understand why XMLWriter in PHP behaves this way, you need to know that XMLWriter is based on the libxml2 library. The extension in PHP for most of the work done passes the calls through to libxml:
PHP's xmlwriter_write_cdata
delegates to libxml xmlTextWriterWriteCDATA
which does the suspected sequence of xmlTextWriterStartCDATA
, xmlTextWriterWriteString
and xmlTextWriterEndCDATA
.
xmlTextWriterWriteString
is used in many routines (e.g. writing PI) but only for some text-writing cases the content parameter string is encoded:
- Name,
- Text and
- Attribute.
For all others, it's passed as-is. This includes CDATA, so the data passed to XMLWriter::writeCData
must match the requirements for XML CData (because that is written by that method):
- [20]
CData ::= (Char* - (Char* ']]>' Char*))
Which is technically saying: Any string not containing "]]>
".
This can be easily oversighted, I myself suspected this could be a bug yesterday. And I'm not the only one, a related bug-report on PHP.net is: https://bugs.php.net/bug.php?id=44619 from years ago.
See as well What does <![CDATA[]]> in XML mean?