Filename contains an umlaut (ä, ö, ü) and, therefore, the filename seems to be different

Question

When I load "Österreich" from the Database, it does not match my filename "Österreich". That is the problem.

I have a file called "Österreich.php" which I want to read from the respective directory. When I use strlen() on the "Österreich" (not ".php"), it returns 12, however, it should be 10. This causes problems as I want to load data from the database with it and it, for whatever reason, seems to be a "different" word.

Any ideas?

`strlen("Österreich");` returns 11. So the umlaut counts as a character or needs 2 bytes to represent that character O with an umlaut. Use `mb_strlen`. — AbraCadaver, Mar 11 '21 at 00:03
Is it possible that the data is different when loaded from a directory with `readdir()`? — frank, Mar 11 '21 at 00:04
I don't know, you need to show code that would replicate the `strlen` of 12. But `mb_strlen` should work when working with multibyte stuff. — AbraCadaver, Mar 11 '21 at 00:06
Well, even if it is 11 (as in your example) that causes problems. Österreich clearly has 10 characters. Why is it so complicated here? — frank, Mar 11 '21 at 00:08
yes, but I don't need the length of the string. When I load "Österreich" from the Database, it does not match my filename "Österreich". That is the problem. — frank, Mar 11 '21 at 00:11
Or is it not possible to "store" the word "Österreich" so it is treated as a normal string and not multibyte string? — frank, Mar 11 '21 at 00:15
I deal in English only and only have a little familiarity with multibyte so I'll wait for someone else to help. — AbraCadaver, Mar 11 '21 at 00:16
there's no such thing as a "multibyte string". A string is a series of bytes and an assosiated character set that's used to translate those bytes into a visual representation. Some character sets use multiple bytes to represent a single visual glyph, eg: UTF-8 — Sammitch, Mar 11 '21 at 00:17
okay, so when I have "Österreich" in the Database and "Österreich" as a file name and they dont match (`"Österreich" == "Österreich"` returns false in my case), what can I do? Are my php script and the database configurated differently? — frank, Mar 11 '21 at 00:18
The `Ö` in DB might be different from the `Ö` on your filesystem. Here's a similiar question I'd asked in the past https://unix.stackexchange.com/questions/494883/utf8-character-makes-file-inaccessible. — user3783243, Mar 11 '21 at 00:55
Well that explains it. Mac's HFS UTF8 handling is moronic. https://stackoverflow.com/a/6153713/1064767 They basically _require_ the 'non-ideal' form for filenames, and de-normalizing a string like that is a pain in PHP. I would VERY strongly suggest running in a VM for this, and various other reasons. — Sammitch, Mar 11 '21 at 01:13

Sammitch · Answer 1 · 2021-03-11T01:23:25.880

The hint is that the Ö should ideally be a two-byte UTF8 sequence and the byte length of the string would be 11, not 12.

The only way that I can think of that Österreich occupies 12 bytes is if it's in a non-ideal-but-still-valid-form of a regular O plus a separate umlaut combining mark. Eg: O\u{0308}sterreich

function utf8_denormalize($string) {
    return implode('',
        array_map(
            function($c){
                if(strlen($c) > 1){
                    return Normalizer::getRawDecomposition($c);
                }
                return $c;
            },
            preg_split('//u', $string)
        )
    );
}

$str1 = "Österreich";
$str2 = "O\u{0308}sterreich";
$str3 = Normalizer::normalize($str2);
$str4 = utf8_denormalize($str1);

var_dump(
    $str1,
    $str2,
    $str3,
    $str4,
    $str1 === $str3,
    $str2 === $str4
);

Output:

string(11) "Österreich"
string(12) "Österreich"
string(11) "Österreich"
string(12) "Österreich"
bool(true)
bool(true)

I would say that the data on both sides of this issue should be inspected and/or normalized, but you should also be careful as you may have "duplicate" filenames in your database and/or filesystem comprised of the normalized and un-normalized forms of various strings.

https://www.php.net/manual/en/normalizer.normalize.php

Edit

Mac HFS is dumb and requires the de-normalized form for filenames. I've cobbled together a de-normalizer, [YMMV] but honestly unless your production environment is a Mac machine you should be testing your code against a VM that matches your production environment as closely as possible. Filesystem peculiarities are just one of many edge cases that will throw wrenches into the works.

Thank you! But it says `"Class 'Normalizer' not found"`. I am currently trying to find a solution — frank, Mar 11 '21 at 00:39

Filename contains an umlaut (ä, ö, ü) and, therefore, the filename seems to be different

1 Answers1

Edit