1

When I load "Österreich" from the Database, it does not match my filename "Österreich". That is the problem.

I have a file called "Österreich.php" which I want to read from the respective directory. When I use strlen() on the "Österreich" (not ".php"), it returns 12, however, it should be 10. This causes problems as I want to load data from the database with it and it, for whatever reason, seems to be a "different" word.

Any ideas?

AbraCadaver
  • 78,200
  • 7
  • 66
  • 87
frank
  • 21
  • 2
  • `strlen("Österreich");` returns 11. So the umlaut counts as a character or needs 2 bytes to represent that character O with an umlaut. Use `mb_strlen`. – AbraCadaver Mar 11 '21 at 00:03
  • Is it possible that the data is different when loaded from a directory with `readdir()`? – frank Mar 11 '21 at 00:04
  • I don't know, you need to show code that would replicate the `strlen` of 12. But `mb_strlen` should work when working with multibyte stuff. – AbraCadaver Mar 11 '21 at 00:06
  • Well, even if it is 11 (as in your example) that causes problems. Österreich clearly has 10 characters. Why is it so complicated here? – frank Mar 11 '21 at 00:08
  • `mb_strlen` returns 10. – AbraCadaver Mar 11 '21 at 00:09
  • yes, but I don't need the length of the string. When I load "Österreich" from the Database, it does not match my filename "Österreich". That is the problem. – frank Mar 11 '21 at 00:11
  • Or is it not possible to "store" the word "Österreich" so it is treated as a normal string and not multibyte string? – frank Mar 11 '21 at 00:15
  • I deal in English only and only have a little familiarity with multibyte so I'll wait for someone else to help. – AbraCadaver Mar 11 '21 at 00:16
  • there's no such thing as a "multibyte string". A string is a series of bytes and an assosiated character set that's used to translate those bytes into a visual representation. Some character sets use multiple bytes to represent a single visual glyph, eg: UTF-8 – Sammitch Mar 11 '21 at 00:17
  • okay, so when I have "Österreich" in the Database and "Österreich" as a file name and they dont match (`"Österreich" == "Österreich"` returns false in my case), what can I do? Are my php script and the database configurated differently? – frank Mar 11 '21 at 00:18
  • The `Ö` in DB might be different from the `Ö` on your filesystem. Here's a similiar question I'd asked in the past https://unix.stackexchange.com/questions/494883/utf8-character-makes-file-inaccessible. – user3783243 Mar 11 '21 at 00:55
  • I am on a mac!! – frank Mar 11 '21 at 00:59
  • Well that explains it. Mac's HFS UTF8 handling is moronic. https://stackoverflow.com/a/6153713/1064767 They basically _require_ the 'non-ideal' form for filenames, and de-normalizing a string like that is a pain in PHP. I would VERY strongly suggest running in a VM for this, and various other reasons. – Sammitch Mar 11 '21 at 01:13

1 Answers1

0

The hint is that the Ö should ideally be a two-byte UTF8 sequence and the byte length of the string would be 11, not 12.

The only way that I can think of that Österreich occupies 12 bytes is if it's in a non-ideal-but-still-valid-form of a regular O plus a separate umlaut combining mark. Eg: O\u{0308}sterreich

function utf8_denormalize($string) {
    return implode('',
        array_map(
            function($c){
                if(strlen($c) > 1){
                    return Normalizer::getRawDecomposition($c);
                }
                return $c;
            },
            preg_split('//u', $string)
        )
    );
}

$str1 = "Österreich";
$str2 = "O\u{0308}sterreich";
$str3 = Normalizer::normalize($str2);
$str4 = utf8_denormalize($str1);

var_dump(
    $str1,
    $str2,
    $str3,
    $str4,
    $str1 === $str3,
    $str2 === $str4
);

Output:

string(11) "Österreich"
string(12) "Österreich"
string(11) "Österreich"
string(12) "Österreich"
bool(true)
bool(true)

I would say that the data on both sides of this issue should be inspected and/or normalized, but you should also be careful as you may have "duplicate" filenames in your database and/or filesystem comprised of the normalized and un-normalized forms of various strings.

https://www.php.net/manual/en/normalizer.normalize.php

Edit

Mac HFS is dumb and requires the de-normalized form for filenames. I've cobbled together a de-normalizer, [YMMV] but honestly unless your production environment is a Mac machine you should be testing your code against a VM that matches your production environment as closely as possible. Filesystem peculiarities are just one of many edge cases that will throw wrenches into the works.

Sammitch
  • 30,782
  • 7
  • 50
  • 77