4

DOMDocument seems to convert Chinese characters into codes, for instance,

你的乱发 will become ä½ çš„ä¹±å‘

How can I keep the Chinese or other foreign language as they are instead of converting them into codes?

Below is my simple test,

$dom = new DOMDocument();
$dom->loadHTML($html);

If I add this below before loadHTML(),

$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8"); 

I get,

你的乱发

Even though the coverted codes will be displayed as Chinese characters, 你的乱发 still are not 你的乱发 what I am after....

hakre
  • 193,403
  • 52
  • 435
  • 836
Run
  • 54,938
  • 169
  • 450
  • 748
  • 4
    That's not "converting them into codes," that's "breaking the encoding." What is the encoding of the original data? Are you sure the file is saved as UTF-8? – Matt Ball Apr 19 '12 at 21:43
  • The characters are displayed in ASCII instead of UTF-8. Do you have in the head section of you html file? – BertR Apr 20 '12 at 07:22
  • yes I have in the head section of my html file. But I found anyway to get around to this issue. thanks. – Run Apr 20 '12 at 15:38
  • @lauthiamkok: Please add your solution / workaround as an answer below and accept it. Your question is still marked as not-solved albeit it has a solution. Please help us making this site better. Thank you! – hakre May 31 '12 at 13:47

3 Answers3

8

DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the Chinese or other foreign language as they are instead of converting them into codes?

$dom = new DOMDocument();
$dom->loadHTML($html);

If you're using the loadHTML function to load a HTML chunk. By default DOMDocument expects that string to be in HTML's default encoding (ISO-8859-1) however most often the charset (sic!) is meta-information provided next to the string you're using and not inside. To make this more complicated, that meta-information be be even inside the string.

Anyway as you have not shared the string data of the HTML and you have not specified the encoding, it's hard to tell specifically what is going on.

I assume the HTML is UTF-8 encoded but this is not signalled within the HTML string. So the following work-around can help:

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

It injects an encoding hint on the very beginning (and removes it after the HTML has been loaded). From that point on, DOMDocument will return UTF-8 (as always).

hakre
  • 193,403
  • 52
  • 435
  • 836
  • 1
    Also work with `` which might be more HTML friendly. A more in deep discussion is within a [similar answer to question "PHP DomDocument failing to handle utf-8 characters (☆)"](http://stackoverflow.com/a/11310258/367456). – hakre Jul 03 '12 at 12:03
2

I just stumbled upon this thread when searching for a solution of a similar problem, i after loading the html properly and doing some parsing with Xpath etc... my text ends up like this:

&#20320;&#30340;&#20081;&#21457;

this display fine in the body of the HTML, but won't display properly in a style or script tag (e.g. setting chinese-fonts).

to fix this, do the reverse lauthiamkok did:

$html = mb_convert_encoding($html, "UTF-8", "HTML-ENTITIES");

if for any reason the first workaround doesn't work for you, try this conversion.

Suau
  • 4,628
  • 22
  • 28
0

I'm pretty sure ä½ çš„ä¹±å‘ is actually Windows Latin 1 (not ASCII, there are no diacritics in ASCII). Somewhere along the way your UTF-8 text got saved as Windows Latin 1....

dda
  • 6,030
  • 2
  • 25
  • 34