0

why does an "en dash" in a title tag break unicode strings in DOMDocument? this code

<?php
$html = <<<'HTML'
<!DOCTYPE html>
<html><head>
    <title>example.org – example.org - example.org</title>
    <meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
HTML;
$domd = new DOMDocument("1.0", "UTF-8");
@$domd->loadHTML($html);
$xp = new DOMXPath($domd);
$interesting = $domd->getElementsByTagName("body")->item(0)->textContent;
var_dump($interesting, bin2hex($interesting));

prints the nonsense

string(14) "Trädgård"
string(28) "5472c383c2a46467c383c2a57264"

however if we just remove the en-dash from line 5, change it to

    <title>example.org example.org - example.org</title>

it prints

string(10) "Trädgård"
string(20) "5472c3a46467c3a57264"

so why does en-dash break unicode strings in DOMDocument?

(took me a long time to track down that the en-dash is the cause x.x )

hanshenrik
  • 19,904
  • 4
  • 43
  • 89

1 Answers1

1

don't know why, exactly, but the key here seems to be that any unicode characters occurring before the utf-8 declaration will confuse it, meaning:

<!DOCTYPE html>
<html><head>
    <title>æøå</title>
    <meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>

will confuse it, while

<!DOCTYPE html>
<html><head>
    <meta charset="utf-8" />
    <title>æøå</title>
</head>
<body>Trädgård</body>
</html>

works fine.. and @Tino Didriksen found this quote from https://www.w3.org/International/questions/qa-html-encoding-declarations

so it's best to put it immediately after the opening head tag.

and.. as the top rated comment in the loadHTML documentation mentions, a quick'n dirty workaround is

$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
hanshenrik
  • 19,904
  • 4
  • 43
  • 89