I'm trying to understand the logic of the two functions mb_detect_encoding
and mb_check_encoding
, but the documentation is poor. Starting with a very simple test string
$string = "\x65\x92";
Which is lowercase 'a' followed by a curly quote mark when using Windows-1252 encoding.
I get the following results:
mb_detect_encoding($string,"Windows-1252"); // false
mb_check_encoding($string,"Windows-1252"); // true
mb_detect_encoding($string,"ISO-8859-1"); // ISO-8859-1
mb_check_encoding($string,"ISO-8859-1"); // true
mb_detect_encoding($string,"UTF-8",true); // false
mb_detect_encoding($string,"UTF-8"); // UTF-8
mb_check_encoding($string,"UTF-8"); // false
I don't understand why
mb_detect_encoding
gives "ISO-8859-1" for the string but not "Windows-1252", when, according to https://en.wikipedia.org/wiki/ISO/IEC_8859-1 and https://en.wikipedia.org/wiki/Windows-1252, the bytex92
is defined in the Windows-1252 character encoding but not in ISO-8859-1.Secondly, I don't understand how
mb_detect_encoding
can returnfalse
, butmb_check_encoding
can returntrue
for the same string and same character encoding.Finally, I don't understand why the string can ever be detected as UTF-8, strict mode or not. The byte
x92
is a continuation byte in UTF-8, but in this string, it's following a valid character byte, not a leading byte for a sequence.