When you "round-trip" convert some HTML to a DOMDocument object and then back to HTML, any
entities are converted to unicode nonbreaking spaces. That in itself isn't a problem because both are valid HTML. However, if you send that output through a round-trip again, all nonbreaking space characters become garbled with an extra character prepended that looks like an Â
. More round trips make the output weirder and weirder. Is this a bug in php, as this longstanding #67727 on php.net suggests? Or am I doing something wrong?
Example:
function roundTrip($html) {
$doc = new DOMDocument();
$doc->loadHTML("<html><body>$html</body></html>");
$node = $doc->getElementsByTagName('body')[0]->firstChild;
return $node->ownerDocument->saveHTML($node);
};
$original = '<p>Hello world</p>';
$pass1 = roundTrip($original);
$pass2 = roundTrip($pass1);
$pass3 = roundTrip($pass2);
echo json_encode($original); // "<p>Hello world</p>"
echo json_encode($pass1); // "<p>Hello \u00a0 world</p>"
echo json_encode($pass2); // "<p>Hello \u00c2\u00a0 world</p>"
echo json_encode($pass3); // "<p>Hello \u00c3\u0082\u00c2\u00a0 world</p>"
*
an astute observer will notice that I added JSON_UNESCAPED_SLASHES
to the example output for readability but left it out of the example code for brevity.