1

I've been at this for half a day, so now it's time to ask for help.

What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.

$html =
'<!doctype html>
<html lang="en">
    <head>
        <meta charset="utf-8">
    </head>
    <body>
        <p>&#39; &quot; & &lt; © 庭</p>
    </body>
</html>';

Then I run:

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();

And get entity output:

input: &#39; &quot; & &lt; © 庭
output: ' " &amp; &lt; &copy; &#24237;

Why is DOMDocument converting &#39; and &quot; to actual quote marks? The only thing it didn't touch was &lt;.

Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn't.

Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html); trick.

Jeff
  • 5,962
  • 16
  • 49
  • 81
  • `DOMDocument` parses everything into a canonical internal form of the DOM. It doesn't remember the format you used in the input HTML. So there's no way to get it to leave things alone. – Barmar Apr 19 '23 at 22:36

2 Answers2

1

I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

require 'vendor/autoload.php';

$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();

Result:

input: &#39; &quot; &lt; © 庭 &nbsp; &
output: &#39; &quot; &lt; © 庭 &nbsp; &amp;
Jeff
  • 5,962
  • 16
  • 49
  • 81
0

You need to provide a specific element to the saveHTML() method. This will have it do a minimalist approach to encoding entities. It will still encode those that are necessary. I don't think there's a way to prevent all entity encoding from happening, but it won't try to encode every entity it can.

$html = $dom->saveHTML($dom);
// ' " &amp; &lt; © 庭
Jim
  • 3,210
  • 2
  • 17
  • 23
  • This helps, thanks. I'll see what I can come up with. [Here's a nice little loop](https://stackoverflow.com/a/16931835/142233) I'll modify and try. – Jeff Apr 19 '23 at 22:45