Everything in this answer is based on my reading of the code here and here.
I did not write it, I did not step through it with a debugger, this is my interpretation only.
It seems that the intention was for strict mode to check if the string as a whole was valid for the encoding, while non-strict mode would allow for a sub-sequence that could be part of a valid string. For example, if the string ended with what should be the first byte of a multi-byte character it would not match in strict mode but would still qualify as UTF-8 under non-strict mode.
However there seems to be a bug* where in non-strict mode only the first byte of the string is being checked in some circumstances.
Example:
The byte 0xf8
is not allowed anywhere in UTF-8. When placed at the start of a string mb_detect_encoding()
properly returns false for it regardless of which mode is used.
$str = "\xf8foo";
var_dump(
mb_detect_encoding($str, 'UTF-8'), // bool(false)
mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);
But as long as the leading byte may occur anywhere in a UTF-8 sequence, non-strict mode returns UTF-8.
$str = "foo\xf8";
var_dump(
mb_detect_encoding($str, 'UTF-8'), // string(5) "UTF-8"
mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);
So while your ISO-8859-1 string 'áéóú'
is not valid UTF-8, the first byte "\xe1"
can occur in UTF-8 and mb_detect_encoding()
mistakenly returns the string as such.
*I've opened a report for this at https://bugs.php.net/bug.php?id=72933