The hint is that the Ö
should ideally be a two-byte UTF8 sequence and the byte length of the string would be 11, not 12.
The only way that I can think of that Österreich
occupies 12 bytes is if it's in a non-ideal-but-still-valid-form of a regular O
plus a separate umlaut combining mark. Eg: O\u{0308}sterreich
function utf8_denormalize($string) {
return implode('',
array_map(
function($c){
if(strlen($c) > 1){
return Normalizer::getRawDecomposition($c);
}
return $c;
},
preg_split('//u', $string)
)
);
}
$str1 = "Österreich";
$str2 = "O\u{0308}sterreich";
$str3 = Normalizer::normalize($str2);
$str4 = utf8_denormalize($str1);
var_dump(
$str1,
$str2,
$str3,
$str4,
$str1 === $str3,
$str2 === $str4
);
Output:
string(11) "Österreich"
string(12) "Österreich"
string(11) "Österreich"
string(12) "Österreich"
bool(true)
bool(true)
I would say that the data on both sides of this issue should be inspected and/or normalized, but you should also be careful as you may have "duplicate" filenames in your database and/or filesystem comprised of the normalized and un-normalized forms of various strings.
https://www.php.net/manual/en/normalizer.normalize.php
Edit
Mac HFS is dumb and requires the de-normalized form for filenames. I've cobbled together a de-normalizer, [YMMV] but honestly unless your production environment is a Mac machine you should be testing your code against a VM that matches your production environment as closely as possible. Filesystem peculiarities are just one of many edge cases that will throw wrenches into the works.