1

There are two approaches to getting the outer HTML of a DOMDocument node suggested here: How to return outer html of DOMDocument?

I'm interested in why they seem to treat HTML entities differently.

EXAMPLE:

function outerHTML($node) {
    $doc = new DOMDocument();
    $doc->appendChild($doc->importNode($node, true));
    return $doc->saveHTML();
}

$html = '<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>';
$dom = new DOMDocument();
@$dom->loadHTML($html);
$el = $dom->getElementsByTagname('p')->item(0);
echo $el->ownerDocument->saveHtml($el) . PHP_EOL;
echo outerHTML($el) . PHP_EOL;

OUTPUT:

<p>ACME’s 27” Monitor is $200.</p>
<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>

Both methods use saveHTML() but for some reason the function preserves html entities in the final output, while directly calling saveHTML() with a node context does not. Can anyone explain why - preferably with some kind of authoritative reference?

miken32
  • 42,008
  • 16
  • 111
  • 154
  • See also: https://stackoverflow.com/questions/51660286/why-does-domdocumentsavehtmls-behavior-differ-in-encoding-utf-8-as-entities – miken32 Jan 27 '20 at 22:36

1 Answers1

1

What this comes down to is even more simple than your test case above:

<?php
$html = '<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>';
$dom = new DOMDocument();
@$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

echo $dom->saveHtml($dom->documentElement) . PHP_EOL;
echo $dom->saveHtml() . PHP_EOL;

So the question becomes, why does DomDocument::saveHtml behave differently when saving an entire document instead of just a specific node?

Taking a peek at the PHP source, we find a check for whether it's working with a single node or a whole document. For the former, the htmlNodeDumpFormatOutput function is called with the encoding explicitly set to null. For the latter, the htmlDocDumpMemoryFormat function is used, the encoding is not included as an argument to this function.

Both of these functions are from the libxml2 library. Looking at that source, we can see that htmlDocDumpMemoryFormat tries to detect the document encoding, and explicitly sets it to ASCII/HTML if it can't find one.

Both functions end up calling htmlNodeListDumpOutput, passing it the encoding that's been determined; either null – which results in no encoding – or ASCII/HTML – which encodes using HTML entities.

My guess is that, for a document fragment or single node, encoding is considered less important than for a full document.

miken32
  • 42,008
  • 16
  • 111
  • 154
  • *"My guess is that, for a document fragment or single node, encoding is considered less important than for a full document."* - hahaha.. well it's valid speculation but I'd beg to differ if that's truly the case. At any rate, thank you for the research. Do you think these is any way to get `saveHtml()` to product the same output regardless of context node or not? – But those new buttons though.. Jan 27 '20 at 23:59
  • 1
    I've come across a bunch of questions asking how to output with/without entities. Best bet is something like `mb_convert_encoding($dom->saveHtml(), 'UTF-8', 'HTML-ENTITIES')` to get rid of entities. – miken32 Jan 28 '20 at 00:10