DOMDocument with XPath Encoding

Question

$convertedhtml = urlencode(mb_convert_encoding($htmlcode,'UTF-8',"auto"));
$doc = new DOMDocument();
$doc->loadHTML($convertedhtml);

$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[@id='detail']/div[1]/h3/text()");
$elements->item(0)->nodeValue;

return ($elements->item(0)->nodeValue);

The website is in gbk encoding. If i do a Convert , it will not even show anything, but if i dont convert, it doesnt show the correct characters.

Any idea ? From what i know, mb_* doesn't support gbk?

http://stackoverflow.com/questions/3265824/php-utf-8-to-gb2312 — Rikesh, Jul 03 '13 at 12:29

score 1 · Accepted Answer · edited May 23 '17 at 12:09

The DOMDocument::loadHTML() method does not expect an UTF-8 encoded string. So you can say it is an exception to the many other methods in the DOM extension because all those expect an UTF-8 encoded string. Same btw. applies to all methods of the DOM extension that care about loading XML/HTML data from either a file, a remote-location or a string. They follow different and more complex rules for the encoding of the string.

Encoding for DOMDocument::loadHTML():

If the HTML string you pass in there does not contain any hinting on the encoding (e.g. inside meta-tags), then the encoding of the string must be Latin-1.

If the string does have a hint of the encoding, then it needs to be in that hinted encoding and that one needs to be one of the supported encodings.

Notes:

I'm not aware if a list of supported encodings exists.
As you don't show your HTML code you load in there, I can't say if it contains a hint on the encoding.
I'm not aware if a list of all supported ways to hint the encoding with HTML for DOMDocument::loadHMTL() exists.

However: For an example on how to load a HTML document or fragment of a specific encoding see this related answer of mine:

PHP DomDocument failing to handle utf-8 characters (☆)

It most likely will show you how you can load your HTML. It also explains this in more detail. Let me know if it doesn't solve your issue.

DOMDocument with XPath Encoding

1 Answers1