I had a similar problem but was not able to force UTF-16LE
as the input charset could change. Finally I detect UTF-8
as follows:
if (!preg_match('~~u', $html)) {
For the case that this fails I obtain the correct encoding through the BOM:
function detect_bom_encoding($str) {
if ($str[0] == chr(0xEF) && $str[1] == chr(0xBB) && $str[2] == chr(0xBF)) {
return 'UTF-8';
}
else if ($str[0] == chr(0x00) && $str[1] == chr(0x00) && $str[2] == chr(0xFE) && $str[3] == chr(0xFF)) {
return 'UTF-32BE';
}
else if ($str[0] == chr(0xFF) && $str[1] == chr(0xFE)) {
if ($str[2] == chr(0x00) && $str[3] == chr(0x00)) {
return 'UTF-32LE';
}
return 'UTF-16LE';
}
else if ($str[0] == chr(0xFE) && $str[1] == chr(0xFF)) {
return 'UTF-16BE';
}
}
And now I'm able to use iconv()
as you can see in @carpetsmoker answer:
iconv(detect_bom_encoding($html), 'UTF-8', $html);
I did not use mb_convert_encoding()
as it did not remove the BOM (and did not convert the linebreaks as iconv()
does):
