With this command:
$doc->loadHTML($html);
you're commanding the DOMDocument to load your string $html
$html = '<div id="demo">à la téléchargez mêmes</div>';
with the ISO-8859-1 encoding.
But the string you use there was not viewed / typed by yourself in that ISO-8859-1 encoding but in the UTF-8 encoding.
So technically spoken, you've typed it wrong there ;)
Then on the other hand, when you command with your script to return a value:
$xpath->query("//div[@id='demo']")->item(0)->nodeValue;
that value will be UTF-8 encoded (scroll down to the Notes section and read about the character encoding).
To get a better view on the document, just output it directly after the call to loadHTML
so that you can better see what is going on (echo $doc->saveHTML();
, beautified):
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div id="demo">
à la téléchargez mêmes
</div>
</body>
</html>
As you can see, you've explicitly commanded to insert Atile and the non-breaking-space and all these other characters, the string was taken as HTML 4.0 and as the HTML in your string didn't come with any specific character encoding specified, the default encoding (ISO-8859-1) was used.
So for what you do there, you can further read on with existing material that covers this and has even more information:
And additionally to the answer given in the first of the two there is an additional way to do this in your case:
$saved = libxml_use_internal_errors(true);
$result = $doc->loadHTML('<?xml>' . $html);
########
libxml_use_internal_errors($saved);
if ($result) {
$doc->removeChild($doc->documentElement->previousSibling);
}
This example not only adds proper error handling and return-value check if the HTML could be actually loaded or not, it also prefixes you string with a magic-sequence "<?xml>
" that will set loadHTML
into UTF-8 mode. After loading the HTML string as with UTF-8 encoding, the DOMProcessingInstruction is removed again. The encoding will remain:
$xpath = new DOMXpath($doc);
echo $xpath->query("//div[@id='demo']")->item(0)->nodeValue;
# prints "à la téléchargez mêmes" now
Find it demonstrated online here across many differen PHP versions: http://3v4l.org/TT3SM