why does an "en dash" in a title tag break unicode strings in DOMDocument? this code
<?php
$html = <<<'HTML'
<!DOCTYPE html>
<html><head>
<title>example.org – example.org - example.org</title>
<meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
HTML;
$domd = new DOMDocument("1.0", "UTF-8");
@$domd->loadHTML($html);
$xp = new DOMXPath($domd);
$interesting = $domd->getElementsByTagName("body")->item(0)->textContent;
var_dump($interesting, bin2hex($interesting));
prints the nonsense
string(14) "Trädgård"
string(28) "5472c383c2a46467c383c2a57264"
however if we just remove the en-dash from line 5, change it to
<title>example.org example.org - example.org</title>
it prints
string(10) "Trädgård"
string(20) "5472c3a46467c3a57264"
so why does en-dash break unicode strings in DOMDocument?
(took me a long time to track down that the en-dash is the cause x.x )