mb_detect_encoding() discrepancy for non latin1 characters

Question

I'm using the mb_detect_encoding() function to check if a string contains non latin1 (ISO-8859-1) characters.

Since Japanese isn't part of latin1 I'm using it as the text within the test string, yet when the string is passed in to the function it seems to return ok for ISO-8859-1. Example code:

$str = "これは日本語のテキストです。読めますか";
$res = mb_detect_encoding($str,"ISO-8859-1",true);

print  $res;

I've tried using 'ASCII' instead of 'ISO-8859-1', which correctly returns false. Is anyone able to explain the discrepancy?

score 0 · Accepted Answer · answered Mar 21 '11 at 23:23

0

I wanted to be funny and say hexdump could explain it:

0000000 81e3 e393 8c82 81e3 e6af a597 9ce6 e8ac
0000010 9eaa 81e3 e3ae 8683 82e3 e3ad b982 83e3
0000020 e388 a781 81e3 e399 8280 aae8 e3ad 8182
0000030 81e3 e3be 9981 81e3 0a8b

But alas, that's quite the opposite.

In ISO-8859-1 practically only the code points \x80-\x9F are invalid. But these are exactly the byte values your UTF-8 representation of the Japanese characters occupy.

Anyway, mb_detect_encoding uses heuristics. And it fails in this example. My conjecture is that it mistakes ISO-8859-1 for -15 or worse: CP1251 the incompatible Windows charset, which allows said code points.

I would say you use a workaround and test it yourself. The only check to assure that a byte in a string is certainly not a Latin-1 character is:

preg_match('/[\x7F-\x9F]/', $str);

I'm linking to the German Wikipedia, because their article shows the differences best: http://de.wikipedia.org/wiki/ISO_8859-1

answered Mar 21 '11 at 23:23

mario

144,265
20
237
291

All ISO-8859 encodings forbid \x80-\x9F, so that would be CP125x – ninjalj Mar 21 '11 at 23:34
Thanks for the info. I can make do ok with an ascii check, so it shouldn't cause me any major problems. But I sure wish php was more consistent with this multibyte character stuff =( – Spoonface Mar 21 '11 at 23:50
FWIW, this is a problem with character sets in general, not with PHP. – Charles Mar 22 '11 at 00:10
@Charles: Indeed. More specifically it turns just out to be an issue of some part of the mbstring extension. Can't find it, but I have a gut feeling it's related to e.g. mbfilter_cp1251.c where a comment says `/* all of this is so ugly now! */` – mario Mar 22 '11 at 00:19

mb_detect_encoding() discrepancy for non latin1 characters

1 Answers1