7

In the function mb_detect_encoding there is a parameter for strict mode.

In the first, most upvoted comment:

<?php
$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false

This is true, yes. But can anybody give me an explanation, why is it?

Will Vousden
  • 32,488
  • 9
  • 84
  • 95
vaso123
  • 12,347
  • 4
  • 34
  • 64
  • 1
    Ultimately that flag gets passed through to [here](https://github.com/php/php-src/blob/c72282a13b12b7e572469eba7a7ce593d900a8a2/ext/mbstring/libmbfl/mbfl/mbfilter.c#L718); but I be damned if I can figure out what it does… – deceze Aug 24 '16 at 07:51
  • FWIW, *yet another* reason not to never use this function, because *detecting* encodings is fundamentally impossible to begin with. Very interesting question nonetheless. – deceze Aug 24 '16 at 07:54
  • @deceze Funny: the only comment about `strict` in the entire source code is `/* set strict flag */` – Álvaro González Aug 24 '16 at 09:55
  • @Álvaro Yup, super helpful. *Thanks, guys…* ಠ_ಠ – deceze Aug 24 '16 at 09:56

3 Answers3

4

Everything in this answer is based on my reading of the code here and here.

I did not write it, I did not step through it with a debugger, this is my interpretation only.


It seems that the intention was for strict mode to check if the string as a whole was valid for the encoding, while non-strict mode would allow for a sub-sequence that could be part of a valid string. For example, if the string ended with what should be the first byte of a multi-byte character it would not match in strict mode but would still qualify as UTF-8 under non-strict mode.

However there seems to be a bug* where in non-strict mode only the first byte of the string is being checked in some circumstances.

Example:

The byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of a string mb_detect_encoding() properly returns false for it regardless of which mode is used.

$str = "\xf8foo";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

But as long as the leading byte may occur anywhere in a UTF-8 sequence, non-strict mode returns UTF-8.

$str = "foo\xf8";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

So while your ISO-8859-1 string 'áéóú' is not valid UTF-8, the first byte "\xe1" can occur in UTF-8 and mb_detect_encoding() mistakenly returns the string as such.


*I've opened a report for this at https://bugs.php.net/bug.php?id=72933

user3942918
  • 25,539
  • 11
  • 55
  • 67
2

áéóú in ISO-8859-1 encodes as:

e1 e9 f3 fa

If you mis-interpret it as UTF-8 you only get four invalid byte sequences. The Multi-Byte extension is basically designed to ignore errors. For instance, mb_convert_encoding() will replace those sequences with question marks or whatever you set with mb_substitute_character().

My educated guess is that strict encoding determines what should be done with invalid byte sequences:

  • false means to remove them
  • true means to keep them

If you ignore these invalid sequences you're obviously discarding extremely valuable information and you only get sensible results in very limited circumstances, e.g.

$str = chr(81);
var_dump( mb_detect_encoding($str, ['ISO-8859-1', 'Windows-1252']) );
var_dump( mb_detect_encoding($str, ['Windows-1252', 'ISO-8859-1']) );

To sum up, mb_detect_encoding() is in general not as useful as you may thing and it's total crap with the default parameters.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360
-2

Because $str is not actual UTF-8, but ISO-8859-1. Since when not strict comparison, UTF-8 may be treated same as ISO-8859-1, but when using strict mode only actual UTF-8 fits for UTF-8 comparison (explained here)

Community
  • 1
  • 1
Justinas
  • 41,402
  • 5
  • 66
  • 96
  • 1
    Those specific characters look very differently in UTF-8 and 8859. They're most certainly *not* the same and cannot be "treated the same". This is only true for the first 128 characters (ASCII), which these do not fall into. That string is plain invalid in UTF-8, period. – deceze Aug 24 '16 at 07:52