The Actual Unicode Characters automatically converted to Numeric values using DOMDocument->saveHTML()

Question

I am using the following function to get the inner html of html string

function DOMinnerHTML($element) 
{ 
    $innerHTML = ""; 
    $children = $element->childNodes; 
    foreach ($children as $child) 
    { 
        $tmp_dom = new DOMDocument('1.0', 'UTF-8');
        $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
        $innerHTML .= trim($tmp_dom->saveHTML()); 
    }

    return $innerHTML; 
}

my html string also contains unicode character. here is example of html string

$html = '<div>Thats True. Yes it is well defined آپ مجھے تم کہہ کر پکاریں</div>';

When I use the above function

$output = DOMinnerHTML($html);

the output is as below

$output = '<div>Thats True. Yes it is well defined 
&#1705;&#1746;&#1748;&#1587;&#1604;&#1591;&#1575</div>';

the actual unicode characters converted to numeric values.

I have debugged the code and found that in DOMinnerHTML function before the following line

$innerHTML .= trim($tmp_dom->saveHTML());

if I echo

echo $tmp_dom->textContent;

It shows the actual unicode characters but after saving to $innerHTML it outputs the numeric symbols. Why it is doing that.

Note: please don't suggest me html_entity_decode like functions to convert numeric symbols to real unicode characters because, I also have user formatted data in my html string, that I don't want to convert.

Note: I have also tried by putting the

<meta http-equiv="content-type" content="text/html; charset=utf-8">

before my html string but no difference.

related: http://stackoverflow.com/questions/6573258/domdocument-and-special-characters — Marko D, Apr 05 '13 at 17:04
Er, is there a problem? Numeric character references should still work fine. OK, they just take up a few more bytes... — bobince, Apr 07 '13 at 20:54

score 1 · Answer 1 · answered Jul 13 '13 at 05:18

I had a similar problem. After reading the above comment, and after further investigation, I found a very simple solution.

All you have to do is use html_entity_decode() to convert the output of saveHTML(), as follows:

// Create a new dom document
$dom = new DOMDocument();


// .... Do some stuff, adding nodes, ...etc.


// the html_entity_decode function will solve the unicode issue you described
$result = html_entity_decode($dom->saveHTML();

// echo your output
echo $result;

This will ensure that unicode characters are displayed properly

score 0 · Answer 2 · answered Apr 05 '13 at 17:08

Good question, and you did an excellent job narrowing down the problem to a single line of code that caused things to go haywire! This allowed me to figure out what is going wrong.

The problem is with the DOMDocument's saveHTML() function. It is doing exactly what it is supposed to do, but it's design is not what you wanted.

saveHTML() converts the document into a string "using HTML formatting" - which means that it does HTML entity encoding for you! Sadly, this is not what you wanted. Comments in the PHP docs also indicate that DOMDocument does not handle utf-8 especially well and does not do very well with fragments (as it automatically adds html, doctype, etc).

Check out this comment for a proposed solution by simply using another class: alternative to DOMDocument

After seeing many complaints about certain DOMDocument shortcomings, such as bad handling of encodings and always saving HTML fragments with , , and DOCTYPE, I decided that a better solution is needed.

So here it is: SmartDOMDocument. You can find it at http://beerpla.net/projects/smartdomdocument/

Currently, the main highlights are:

SmartDOMDocument inherits from DOMDocument, so it's very easy to use - just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).

saveHTMLExact() - DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain and tags, it adds them automatically (yup, there are no flags to turn this behavior off). Thus, when you call $doc->saveHTML(), your newly saved content now has and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem). SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want - it saves HTML without adding that extra garbage that DOMDocument does.

encoding fix - DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output. SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you - just use loadHTML() as you would normally.

Thanks for your detailed answer. Actually You have understood my problem in depth. But the class that has been given in this answer, I downloaded it and used it but the same problem exists. Even if you try the testHTML() function of this SmartDOMDocument class, it will elaborate that this itself does not show the actual unicode characters but it shows the numeric equllent html code. that is my actual problem. Waiting for the solution. — Munib, Apr 05 '13 at 17:41

score 0 · Answer 3 · answered Aug 23 '15 at 06:57

0

mb_convert_encoding($html,'HTML-ENTITIES','UTF-8');

This worked for me

answered Aug 23 '15 at 06:57

user5256642

1

The Actual Unicode Characters automatically converted to Numeric values using DOMDocument->saveHTML()

3 Answers3