1

I'm working with internationalized filenames in my C-program. There's particularly my piece of code where I create file with Chinese symbol:

int fd = open("/tmp/⺴", O_WRONLY | O_CREAT | O_TRUNC);

This function works well and file is created in spite that my system locale is Russian (LANG=ru_RU.UTF-8).

Why is this file created while my locale seems to not support codes of Chinese symbols? In this case what's the field which is influenced by system locale?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Sas
  • 523
  • 3
  • 21
  • 3
    Your code set is UTF-8; that supports Chinese characters as well as Cyrillic, and Hindi, Arabic, Hebrew, Greek, and so on — essentially all human languages, though there are probably a few esoteric ones still not defined in Unicode (Klingon isn't represented). The underlying file system probably supports UTF-8 names (any characters in UTF-8). What the Russian part of the locale does is ensure that the Cyrillic characters are sorted appropriately — in the manner that Russians expect — but it provides some ordering (possibly, but not necessarily, "code set order") for non-Cyrillic characters. – Jonathan Leffler Jun 25 '19 at 08:28
  • 1
    @JonathanLeffler do you have any reference for linux filesystem "supporting" utf-8? afaik on linux, paths are just null-terminated byte strings with no specific encoding. They could be utf-8 just as they could be anything else. If you try to use ebcdic filenames, then linux will treat `a` as the path separator ;-). –  Jun 25 '19 at 09:53
  • 1
    @mosvy you're correct. Linux kernel just considers a filename as a string of bytes, and so does the `open` syscall. – Antti Haapala -- Слава Україні Jun 25 '19 at 10:05
  • Actually, it turns out that the kernel must do a little bit of work to support UTF-8 filenames properly. Specifically, it needs to normalize combining forms. You're not allowed to (and you wouldn't want to) be able to have distinct files `schön` and `schön`. So if you create a filename with the former spelling, the kernel is supposed to quietly convert it to the latter. I know MacOS does this, and I would have thought Linux does, although mosvy's answer suggests otherwise. – Steve Summit Jun 25 '19 at 11:37
  • (What's the difference between `schön` and `schön`? You can't see it without special tools, but one has `o` followed by a combining umlaut, while the other has a precomposed `ö` character. [Well, they were when I entered them, but it looks like Stackoverflow and/or my browser may have normalized them, too.]) – Steve Summit Jun 25 '19 at 11:42
  • @SteveSummit linux doesn't do that. see the hó/hó example from my answer. As to MacOS doing that, that was a stupid idea, but discussing it here is probably off-topic. Linux does however a bit of encoding in order to support ntfs/cifs and other microsoft filesystems, though that's not tied to the locales mechanism. –  Jun 25 '19 at 11:46
  • @mosvy: I don't have a direct reference for it, but: (1) if the file system is not aware of UTF-8, it will almost certainly allow any sequence of bytes in a file name except for bytes `'/'` and `'\0'`, so it can accept UTF-8 names without problem (but will also accept names that are not valid when viewed as UTF-8) — as long as the name is short enough that no truncation occurs; and (2) if the file system is aware of UTF-8, then of course it supports UTF-8 directly, and may deal with normalization, etc. A file system that's not aware of UTF-8 won't prevent homonyms, of course. – Jonathan Leffler Jun 25 '19 at 19:20
  • 1
    See also [What characters are forbidden in Windows and Linux directory names?](https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names) – Jonathan Leffler Jun 25 '19 at 19:25

1 Answers1

3

The open(2) function is a just a wrapper for the open system call -- and and does nothing else that putting the arguments in the right registers, performing the system call and retrieving its return value.

And the kernel doesn't know or care about locales at all.

Specifically, in the path argument of open(2) the only bytes which have special significance are 47 (/) which separates path elements and 0 (the NUL byte) which terminates it.

Neither the kernel not most filesystems will prevent you from creating files with names which are malformed utf-8 or any binary garbage -- for the kernel they're just bytes.

Also, the kernel isn't doing any unicode normalization or handling of confusables:

$ echo > ∕еtс∕раsswd; touch hó hó
$ ls
hó  hó  ∕еtс∕раsswd
  • I think the kernel is *supposed* to normalize composed versus precomposed forms, to avoid rank confusion and security holes. MacOS does this; I didn't realize Linux didn't. (But it's an interesting tradeoff between "just treat it as a string of bytes no matter what" and "satisfy various other goals".) – Steve Summit Jun 25 '19 at 11:45