DomDocument and special characters

Question

This is my code:

$oDom = new DOMDocument();
$oDom->loadHTML("èàéìòù");
echo $oDom->saveHTML();

This is the output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&Atilde;&uml;&Atilde;&nbsp;&Atilde;&copy;&Atilde;&not;&Atilde;&sup2;&Atilde;&sup1;</p></body></html>

I want this output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>èàéìòù</p></body></html>

I've tried with ...

$oDom = new DomDocument('4.0', 'UTF-8');

or with 1.0 and other stuffs but nothing.

Another thing ... There is a way to obtain the same untouched HTML? For example with this html in input <p>hello!</p> obtain the same output <p>hello!</p> using DOMDocument only for parsing the DOM and to do some substitutions inside the tags.

given you've got `Ã`, in the output, something's mangling your UTF-8 and making it look like iso-8859 or similar. — Marc B, Jul 04 '11 at 15:27
possible duplicate of [PHP DOMDocument loadHTML not encoding UTF-8 correctly](http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) — cmbuckley, Feb 11 '13 at 10:18

Francesco Casula · Accepted Answer · 2014-02-17T09:38:34.090

65

Solution:

$oDom = new DOMDocument();
$oDom->encoding = 'utf-8';
$oDom->loadHTML( utf8_decode( $sString ) ); // important!

$sHtml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
$sHtml .= $oDom->saveHTML( $oDom->documentElement ); // important!

The saveHTML() method works differently specifying a node. You can use the main node ($oDom->documentElement) adding the desired !DOCTYPE manually. Another important thing is utf8_decode(). All the attributes and the other methods of the DOMDocument class, in my case, don't produce the desired result.

edited Feb 17 '14 at 09:38

answered Jul 08 '11 at 06:11

Francesco Casula

26,184
15
132
131

18

To make this work with other characters outside of the ISO-8859-1 set, you need to use multi-byte decoding. So that characters like chinese or the euro sign with also be properly encoded. `$oDom->loadHTML(mb_convert_encoding($sString, 'HTML-ENTITIES', 'UTF-8'));` [see here for more info](http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) – Andrew Killen Jul 16 '15 at 19:58
I almost lose my mind trying to solve this! Thank you very much! – George Henrique May 31 '21 at 18:56

score 7 · Answer 2 · answered Feb 28 '20 at 07:34

7

$dom = new DomDocument();
$str = htmlentities($str);
$dom->loadHTML(utf8_decode($str));
$dom->encoding = 'utf-8';
.
.
.
$str = $dom->saveHTML();
$str = html_entity_decode($str);

The above code worked for me.

answered Feb 28 '20 at 07:34

int_ashish

121
1
5

score 7 · Answer 3 · edited May 23 '17 at 10:30

7

Try to set the encoding type after you have loaded the HTML.

$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->encoding = 'utf-8';
echo $dom->saveHTML();

Other way

edited May 23 '17 at 10:30

Community

1
1

answered Jul 04 '11 at 15:32

SAIF

189
1
6

score 6 · Answer 4 · answered Oct 09 '19 at 03:38

I don't know why the marked answer didn't work for my problem. But this one did.

ref: https://www.php.net/manual/en/class.domdocument.php

<?php

            // checks if the content we're receiving isn't empty, to avoid the warning
            if ( empty( $content ) ) {
                return false;
            }

            // converts all special characters to utf-8
            $content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

            // creating new document
            $doc = new DOMDocument('1.0', 'utf-8');

            //turning off some errors
            libxml_use_internal_errors(true);

            // it loads the content without adding enclosing html/body tags and also the doctype declaration
            $doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

            // do whatever you want to do with this code now

?>

score 5 · Answer 5 · answered Jul 06 '11 at 12:03

The issue appears to be known, according to the user comments on the manual page at php.net. Solutions suggested there include putting

<meta http-equiv="content-type" content="text/html; charset=utf-8">

in the document before you put any strings with non-ASCII chars in.

Another hack suggests putting

<?xml encoding="UTF-8">

as the first text in the document and then removing it at the end.

Nasty stuff. Smells like a bug to me.

score 4 · Answer 6 · answered Oct 31 '18 at 12:00

4

This way:

/**
 * @param string $text
 * @return DOMDocument
 */
private function buildDocument($text)
{
    $dom = new DOMDocument();

    libxml_use_internal_errors(true);
    $dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $text);
    libxml_use_internal_errors(false);

    return $dom;
}

answered Oct 31 '18 at 12:00

Ivan Proskuryakov

1,625
2
23
32

1

I needed it for an API endpoint that a mobile app uses. And only this solution worked for me. Thanks :) – Waqas Jul 10 '19 at 12:56

score 3 · Answer 7 · edited Jul 28 '21 at 14:02

3

What worked for me was:

$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

credit: https://davidwalsh.name/domdocument-utf8-problem

edited Jul 28 '21 at 14:02

STA

30,729
8
45
59

answered Mar 20 '20 at 07:11

Pratip Ghosh

1,810
1
12
20

that fixed my issue, on turkish chars. – TCS Feb 26 '22 at 21:44

score 1 · Answer 8 · answered Apr 22 '21 at 08:40

None of the above worked for me but this one did the job:

$fileContent = file_get_contents('my_file.html');
$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($fileContent, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->encoding = 'utf-8';
$html = $dom->saveHTML();
$html = html_entity_decode($html, ENT_COMPAT, 'UTF-8');
echo $html;

score 0 · Answer 9 · answered Jul 04 '11 at 15:15

0

Looks like you just need to set substituteEntities when you create the DOMDocument object.

answered Jul 04 '11 at 15:15

Quentin

914,110
126
1,211
1,335

score 0 · Answer 10 · answered Feb 28 '23 at 21:34

This worked for me:

<?php

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item) {
    if ($item->nodeType == XML_PI_NODE) {
        $doc->removeChild($item); // remove hack
    }
}

?>

Credits: https://www.php.net/manual/en/domdocument.loadhtml.php#95251

DomDocument and special characters

10 Answers10

Linked