55

The webserver is serving responses with utf-8 encoding, all files are saved with utf-8 encoding, and everything I know of setting has been set to utf-8 encoding.

Here's a quick program, to test if the output works:

<?php
$html = <<<HTML
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Test!</title>
</head>
<body>
    <h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;

$dom = new DOMDocument("1.0", "utf-8");
$dom->loadHTML($html);

header("Content-Type: text/html; charset=utf-8");
echo($dom->saveHTML());

The output of the program is:

<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
    <h1>&acirc;&#152;&#134; Hello &acirc;&#152;&#134; World &acirc;&#152;&#134;</h1>
</body></html>

Which renders as:

☆ Hello ☆ World ☆


What could I be doing wrong? How much more specific do I have to be to tell the DOMDocument to handle utf-8 properly?

Syscall
  • 19,327
  • 10
  • 37
  • 52
Greg
  • 21,235
  • 17
  • 84
  • 107
  • Thanks for bringing up the question, a similar one is: [How to keep the Chinese or other foreign language as they are instead of converting them into codes?](http://stackoverflow.com/q/10237238/367456) however you might consider that a hack. – hakre Jul 03 '12 at 10:55
  • Related: [PHP Request #47875 - No option to set HTML input encoding](https://bugs.php.net/bug.php?id=47875) – hakre Jul 03 '12 at 11:59
  • 1
    Strangely enough: php-documentation says: `The DOM extension uses UTF-8 encoding. Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.` see: http://www.php.net/manual/en/intro.dom.php – juwens Mar 26 '13 at 18:46
  • The HTML5 meta charset encoding declaration has been supported since libxml version 2.8.0, so the code sample in the question now works as expected. – Alf Eaton Nov 09 '14 at 17:13
  • the problem is that you're specifying utf8 but `˜` etc are **not** for utf8, but ANSI. the "dagger" for instance is http://hexutf8.com/?q=e280a0 – jar Oct 05 '16 at 03:21

3 Answers3

118

DOMDocument::loadHTML() expects a HTML string.

HTML uses the ISO-8859-1 encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252 in common webbrowsers.

I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.

I'd say it's safe to assume then that you can load an ISO-8859-1 encoded string.

Your string is UTF-8 encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding with the HTML-ENTITIES target encoding does:

  • Those characters that have named entities, will get the named entitiy. € -> &euro;
  • The others get their numeric (decimal) entity, e.g. ☆ -> &#9734;

The following is a code example that makes the progress a bit more visible by using a callback function:

$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
    list($utf8) = $match;
    $entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
    printf("%s -> %s\n", $utf8, $entity);
    return $entity;
}, $html);

This exemplary outputs for your string:

☆ -> &#9734;
☆ -> &#9734;
☆ -> &#9734;

Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML can deal with. That can be done by converting all outside of US-ASCII into HTML Entities:

$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');

Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.

The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a

<meta http-equiv="content-type" content="text/html; charset=utf-8">

which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.

If you don't care the misplaced warnings, you can just add it in front of the string:

$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

Per the HTML 2.0 specs, elements that can only appear in the <head> section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <meta charset="utf-8">
    <title>Test!</title>
  </head>
  <body>
    <h1>☆ Hello ☆ World ☆</h1>    
  </body>
</html>
Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • 2
    @hakre : that was perfect ! you solved my serious problem and now I have no headaches!! – Aliweb Nov 02 '12 at 18:18
  • 1
    +1 Great answer, but which method do you recommend -- using `mb_convert_encoding()` or prepending the meta tag in `loadHTML()`? – Nate Aug 25 '14 at 12:40
  • 1
    @Nate: I would say it depends. I normally do not recommend `mb_convert_encoding()` but for this case I do somehow. However that's a detail of personal preference. And it still depends whether you want to do the conversion in it's own step or you just want to smash that into `DOOMDocument::loadHTML()` which leaks the meta element into the document. I don't know for example what will happen if that element already existed. I have never tested that to a save point, but it normally "just works" (tm). The different ways in the answer are more for explanation. – hakre Aug 25 '14 at 17:11
  • for anyone using the alternative method, I suggest to check DeZeA's answer below, it worked better since it did not remove classes from the html tag – Moshe Shaham Sep 19 '14 at 21:11
18

There's a faster fix for that, after loading your html document in DOMDocument, you just set (or better said reset) the original encoding. Here's a sample code:

$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);

foreach ($dom->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $dom->removeChild($item);
$dom->encoding = 'UTF-8'; // reset original encoding
DeZeA
  • 417
  • 5
  • 9
  • 1
    This worked better than hakre's version of adding the meta tag because adding the meta removed classes from the html tag – Moshe Shaham Sep 19 '14 at 21:11
  • Hmm, might be.. I had the code in a txt with a bunch of usefull snippets. I don't claim that's some original stuff even though that's some pretty standard use of the DOMDocument class. – DeZeA Oct 13 '16 at 13:44
11
<?php
  header("Content-type: text/html; charset=utf-8");
  $html = <<<HTML
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Test!</title>
</head>
<body>
    <h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;

  $html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
  $dom = new DOMDocument("1.0", "utf-8");
  $dom->loadHTML($html);

  header("Content-Type: text/html; charset=utf-8");
  echo($dom->saveHTML());

Output:

<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
    <h1>&#9734; Hello &#9734; World &#9734;</h1>
</body></html>
Syscall
  • 19,327
  • 10
  • 37
  • 52
  • 2
    @powtac: These variant actually does not need that `header` line. All characters not part of us-ascii are entities here. Any browser on earth will always display this properly unless you specify a (wrong) encoding not sharing us-ascii. But just noting, it's not wrong either. – hakre Jul 03 '12 at 12:10