6

I'm trying to understand the logic of the two functions mb_detect_encoding and mb_check_encoding, but the documentation is poor. Starting with a very simple test string

$string = "\x65\x92";

Which is lowercase 'a' followed by a curly quote mark when using Windows-1252 encoding.

I get the following results:

mb_detect_encoding($string,"Windows-1252"); // false
mb_check_encoding($string,"Windows-1252"); // true
mb_detect_encoding($string,"ISO-8859-1"); // ISO-8859-1
mb_check_encoding($string,"ISO-8859-1"); // true
mb_detect_encoding($string,"UTF-8",true); // false
mb_detect_encoding($string,"UTF-8"); // UTF-8
mb_check_encoding($string,"UTF-8"); // false
  • I don't understand why mb_detect_encoding gives "ISO-8859-1" for the string but not "Windows-1252", when, according to https://en.wikipedia.org/wiki/ISO/IEC_8859-1 and https://en.wikipedia.org/wiki/Windows-1252, the byte x92 is defined in the Windows-1252 character encoding but not in ISO-8859-1.

  • Secondly, I don't understand how mb_detect_encoding can return false, but mb_check_encoding can return true for the same string and same character encoding.

  • Finally, I don't understand why the string can ever be detected as UTF-8, strict mode or not. The byte x92 is a continuation byte in UTF-8, but in this string, it's following a valid character byte, not a leading byte for a sequence.

Dom
  • 2,980
  • 2
  • 28
  • 41
  • It's quite interesting to stumble across this question. I'm the author of the new implementation of these functions in PHP 8.0/8.1. I think you will find they behave more consistently now. If you still have any questions, ask me any time. – Alex D Mar 30 '22 at 20:36
  • This is really helpful since I've been running into a weird issue with this too. Given this: mb_detect_encoding("m2", "ASCII,JIS,UTF-8,UTF-16,UTF-32,EUC-JP,SJIS,ISO-8859-1") it returns UTF-18, while anything else I've tested ("m1", "m3", "ma") all return ASCII. I think I'll be trying mb_check_encoding instead – Daniel Mar 01 '23 at 22:53

1 Answers1

2

Your examples do a good job of showing why mb_detect_encoding should be used sparingly, as it is not intuitive and sometimes logically wrong. If it must be used, always pass in strict = true as the third parameter (so non-UTF8 strings don't get reported as UTF-8.

It's a bit more reliable to run mb_check_encoding over an array of desired encodings, in order of likelihood/priority. For example:

$encodings = [
    'UTF-8',
    'Windows-1252',
    'SJIS',
    'ISO-8859-1',
];

$encoding = 'UTF-8';
$string = 'foo';
foreach ($encodings as $encoding) {
    if (mb_check_encoding($string, $encoding)) {
        // We'll assume encoding is $encoding since it's valid
        break;
    }
}

The ordering depends on your priorities though.

Michael Butler
  • 6,079
  • 3
  • 38
  • 46