0

When you "round-trip" convert some HTML to a DOMDocument object and then back to HTML, any   entities are converted to unicode nonbreaking spaces. That in itself isn't a problem because both are valid HTML. However, if you send that output through a round-trip again, all nonbreaking space characters become garbled with an extra character prepended that looks like an Â. More round trips make the output weirder and weirder. Is this a bug in php, as this longstanding #67727 on php.net suggests? Or am I doing something wrong?

Example:

function roundTrip($html) {
    $doc = new DOMDocument();
    $doc->loadHTML("<html><body>$html</body></html>");
    $node = $doc->getElementsByTagName('body')[0]->firstChild;
    return $node->ownerDocument->saveHTML($node);
};

$original = '<p>Hello &nbsp; world</p>';
$pass1 = roundTrip($original);
$pass2 = roundTrip($pass1);
$pass3 = roundTrip($pass2);

echo json_encode($original); // "<p>Hello &nbsp; world</p>"
echo json_encode($pass1);    // "<p>Hello \u00a0 world</p>"
echo json_encode($pass2);    // "<p>Hello \u00c2\u00a0 world</p>"
echo json_encode($pass3);    // "<p>Hello \u00c3\u0082\u00c2\u00a0 world</p>"

* an astute observer will notice that I added JSON_UNESCAPED_SLASHES to the example output for readability but left it out of the example code for brevity.

Coleman
  • 631
  • 5
  • 13
  • 2
    The problem is that you are not correctly instructing `loadHTML` what the character encoding of your input is supposed to be. https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly – 04FS Dec 06 '19 at 14:47
  • I can confirm that's the issue ([demo](https://3v4l.org/5sC27)). It's also not safe to feed `saveHTML()` with plain text rather than HTML but I want to think this is just a quick prototype to serve as proof of concept. – Álvaro González Dec 06 '19 at 14:50
  • Great! @ÁlvaroGonzález can you say more about what's unsafe in this example? Feeding in a string into `loadHTML` like this example is basically what my app is doing. – Coleman Dec 06 '19 at 15:35
  • Basically, input with literal `<` and `&` characters. – Álvaro González Dec 06 '19 at 15:37

0 Answers0