PHP Problems with multibyte strings and using DOMDOCUMENT

Question

Using PHP 5.6.11 - I have a block of HTML that is utf-8 encoded. The multibyte strings are encoded in the text.

Here is one sample of a string:

"You haven’t added"

Viewed with hexdump (see e2 80 99?) on a utf-8 console (linux):

00000000  59 6f 75 20 68 61 76 65  6e e2 80 99 74 20 61 64  |You haven...t ad|

Here it is as html entities:

"You haven&rsquo;t added"

All this is ok. However when I load it into a domdoc, it comes out again mangled (shown as html entities).

"You haven&amp;acirc;&amp;#128;&amp;#153;t added"

Here is the code to generate this snippet.

$text="<html><body>You haven’t added anything.<br></body></html>";
echo  mb_detect_encoding($text)."\n";
$text2= substr($text,strpos($text,"You haven"),20); 
echo $text2."\n";
echo htmlentities($text2); 

$doc = new DOMDocument('1.0',  'utf-8');
$doc->loadHTML($text);
$text2 = $doc->saveHTML();
$text2= substr($text2,strpos($text2,"You haven"),35); 
echo "\n".htmlentities($text2)."\n";

The output of this is:

UTF-8
You haven’t added 
You haven&rsquo;t added 
You haven&amp;acirc;&amp;#128;&amp;#153;t added

I have tried a variety of ideas, but I can't seem to keep domdoc from mangling either the HTML or the multibyte. Any suggestions?

Edit: If I insert a meta tag it works more as expected.

$text='<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head><body>You haven’t added anything.<br></body></html>';

Output:

UTF-8
You haven’t added 
You haven&rsquo;t added 
You haven&rsquo;t added anything.&lt;br&gt;&lt;/

Edit 2:

Inserting the meta tag with charset=utf-8 works fine as well as:

$doc->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'));

Fixes the encoding. I still can't figure out what domdocument is doing with the encoding, I've tried this line at least 3 times earlier but it wasn't working. Perhaps a little time away from the keyboard was needed, because it seems to be working now. I'll update this if there is a problem once I test it on bigger datasets.

First off, the last `htmlentities` is throwing off your example; by that point `$text2` is already HTML encoded so you're double encoding the output. Second, what happens if you `$doc->loadHTML(htmlentities($text))`? — meustrus, Jul 08 '16 at 19:28
@meustrus I just reused $text2 but it is redefined based on $text, so it's not double encoded. See the edit for more information too. I would assume the DOM parsing would be confused if everything was switched to entities, but I haven't tried it yet. — user6972, Jul 08 '16 at 19:33
Possible duplicate of [PHP DOMDocument loadHTML not encoding UTF-8 correctly](http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) — meustrus, Jul 08 '16 at 20:07
->saveHTML doesn't insert entities though. The problem is related to the multibyte characters. I'm not sure what domdoc is doing, but using $doc->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8')); seems to be working now, in spite of me trying that several times this morning after searching for solutions. — user6972, Jul 08 '16 at 22:48
This part is double-encoded: `â`. What came out of `$doc->saveHTML` would be `â`. Granted that still isn't right, but I'd guess it reflects the 3 byte UTF-8 character. Anyway if you found something that works, answer your own question so future readers can find it more easily. I also suggest you look at the existing question flagged above for more info. Good luck; multibyte in PHP is weird and hard. — meustrus, Jul 10 '16 at 01:57

PHP Problems with multibyte strings and using DOMDOCUMENT

0 Answers0