35

This is my code:

$oDom = new DOMDocument();
$oDom->loadHTML("èàéìòù");
echo $oDom->saveHTML();

This is the output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&Atilde;&uml;&Atilde;&nbsp;&Atilde;&copy;&Atilde;&not;&Atilde;&sup2;&Atilde;&sup1;</p></body></html>

I want this output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>èàéìòù</p></body></html>

I've tried with ...

$oDom = new DomDocument('4.0', 'UTF-8');

or with 1.0 and other stuffs but nothing.

Another thing ... There is a way to obtain the same untouched HTML? For example with this html in input <p>hello!</p> obtain the same output <p>hello!</p> using DOMDocument only for parsing the DOM and to do some substitutions inside the tags.

Michael Berkowski
  • 267,341
  • 46
  • 444
  • 390
Francesco Casula
  • 26,184
  • 15
  • 132
  • 131
  • given you've got `Ã`, in the output, something's mangling your UTF-8 and making it look like iso-8859 or similar. – Marc B Jul 04 '11 at 15:27
  • possible duplicate of [PHP DOMDocument loadHTML not encoding UTF-8 correctly](http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) – cmbuckley Feb 11 '13 at 10:18

10 Answers10

65

Solution:

$oDom = new DOMDocument();
$oDom->encoding = 'utf-8';
$oDom->loadHTML( utf8_decode( $sString ) ); // important!

$sHtml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
$sHtml .= $oDom->saveHTML( $oDom->documentElement ); // important!

The saveHTML() method works differently specifying a node. You can use the main node ($oDom->documentElement) adding the desired !DOCTYPE manually. Another important thing is utf8_decode(). All the attributes and the other methods of the DOMDocument class, in my case, don't produce the desired result.

Francesco Casula
  • 26,184
  • 15
  • 132
  • 131
  • 18
    To make this work with other characters outside of the ISO-8859-1 set, you need to use multi-byte decoding. So that characters like chinese or the euro sign with also be properly encoded. `$oDom->loadHTML(mb_convert_encoding($sString, 'HTML-ENTITIES', 'UTF-8'));` [see here for more info](http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) – Andrew Killen Jul 16 '15 at 19:58
  • I almost lose my mind trying to solve this! Thank you very much! – George Henrique May 31 '21 at 18:56
7
$dom = new DomDocument();
$str = htmlentities($str);
$dom->loadHTML(utf8_decode($str));
$dom->encoding = 'utf-8';
.
.
.
$str = $dom->saveHTML();
$str = html_entity_decode($str);

The above code worked for me.

int_ashish
  • 121
  • 1
  • 5
7

Try to set the encoding type after you have loaded the HTML.

$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->encoding = 'utf-8';
echo $dom->saveHTML();

Other way

Community
  • 1
  • 1
SAIF
  • 189
  • 1
  • 6
6

I don't know why the marked answer didn't work for my problem. But this one did.

ref: https://www.php.net/manual/en/class.domdocument.php

<?php

            // checks if the content we're receiving isn't empty, to avoid the warning
            if ( empty( $content ) ) {
                return false;
            }

            // converts all special characters to utf-8
            $content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

            // creating new document
            $doc = new DOMDocument('1.0', 'utf-8');

            //turning off some errors
            libxml_use_internal_errors(true);

            // it loads the content without adding enclosing html/body tags and also the doctype declaration
            $doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

            // do whatever you want to do with this code now

?>
Nurkartiko
  • 183
  • 2
  • 10
5

The issue appears to be known, according to the user comments on the manual page at php.net. Solutions suggested there include putting

<meta http-equiv="content-type" content="text/html; charset=utf-8">

in the document before you put any strings with non-ASCII chars in.

Another hack suggests putting

<?xml encoding="UTF-8">

as the first text in the document and then removing it at the end.

Nasty stuff. Smells like a bug to me.

4

This way:

/**
 * @param string $text
 * @return DOMDocument
 */
private function buildDocument($text)
{
    $dom = new DOMDocument();

    libxml_use_internal_errors(true);
    $dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $text);
    libxml_use_internal_errors(false);

    return $dom;
}
Ivan Proskuryakov
  • 1,625
  • 2
  • 23
  • 32
  • 1
    I needed it for an API endpoint that a mobile app uses. And only this solution worked for me. Thanks :) – Waqas Jul 10 '19 at 12:56
3

What worked for me was:

$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

credit: https://davidwalsh.name/domdocument-utf8-problem

STA
  • 30,729
  • 8
  • 45
  • 59
Pratip Ghosh
  • 1,810
  • 1
  • 12
  • 20
1

None of the above worked for me but this one did the job:

$fileContent = file_get_contents('my_file.html');
$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($fileContent, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->encoding = 'utf-8';
$html = $dom->saveHTML();
$html = html_entity_decode($html, ENT_COMPAT, 'UTF-8');
echo $html;
oneandonlycore
  • 480
  • 6
  • 23
0

Looks like you just need to set substituteEntities when you create the DOMDocument object.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
0

This worked for me:

<?php

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item) {
    if ($item->nodeType == XML_PI_NODE) {
        $doc->removeChild($item); // remove hack
    }
}

?>

Credits: https://www.php.net/manual/en/domdocument.loadhtml.php#95251

joseantgv
  • 1,943
  • 1
  • 26
  • 34