Using PHP 5.6.11 - I have a block of HTML that is utf-8 encoded. The multibyte strings are encoded in the text.
Here is one sample of a string:
"You haven’t added"
Viewed with hexdump (see e2 80 99?) on a utf-8 console (linux):
00000000 59 6f 75 20 68 61 76 65 6e e2 80 99 74 20 61 64 |You haven...t ad|
Here it is as html entities:
"You haven’t added"
All this is ok. However when I load it into a domdoc, it comes out again mangled (shown as html entities).
"You haven’t added"
Here is the code to generate this snippet.
$text="<html><body>You haven’t added anything.<br></body></html>";
echo mb_detect_encoding($text)."\n";
$text2= substr($text,strpos($text,"You haven"),20);
echo $text2."\n";
echo htmlentities($text2);
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML($text);
$text2 = $doc->saveHTML();
$text2= substr($text2,strpos($text2,"You haven"),35);
echo "\n".htmlentities($text2)."\n";
The output of this is:
UTF-8
You haven’t added
You haven’t added
You haven&acirc;&#128;&#153;t added
I have tried a variety of ideas, but I can't seem to keep domdoc from mangling either the HTML or the multibyte. Any suggestions?
Edit: If I insert a meta tag it works more as expected.
$text='<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head><body>You haven’t added anything.<br></body></html>';
Output:
UTF-8
You haven’t added
You haven’t added
You haven’t added anything.<br></
Edit 2:
Inserting the meta tag with charset=utf-8 works fine as well as:
$doc->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'));
Fixes the encoding. I still can't figure out what domdocument is doing with the encoding, I've tried this line at least 3 times earlier but it wasn't working. Perhaps a little time away from the keyboard was needed, because it seems to be working now. I'll update this if there is a problem once I test it on bigger datasets.