How posix system support unicode?

Question

I have seen a lot of API posix system, for example Linux & Mac & android, which accept const char* as the argument for file path.

One of example is dlopen, as the document show, the first argument is const char*, so does it support Unicode file path, for example path with Chinese?

FYI: There are multiple Unicode encodings: UTF-8, which is favored for interchange and on Unix systems for everything, UTF-16 which Windows uses due to baskwards-compatibility-constraints and UTF-32 which can represent every Unicode Codepoint as a unit of constant length. All others are only interesting in niche applications. — Deduplicator, Apr 13 '14 at 03:47
@Deduplicator, I want to wrapper an cross platform Unicode API, I don't want to consider performance and memory usage, because the difference will be small. So for productivity and easy usage, any thoughts about the pros and cons for UTF-8 or UTF-32? — ZijingWu, Apr 13 '14 at 03:57
Go for all UTF-8, that way has the least gotchas. Even on systes having a UTF-16 API, everything else is done in UTF-8, so you are forced to convert there anyway. Just be sure you always use the native Unicode API always, perhaps with a specialised widen/narrow converter. http://utf8everywhere.org — Deduplicator, Apr 13 '14 at 14:52

score 3 · Answer 1 · edited May 23 '17 at 11:46

3

POSIX is not required to support Unicode filenames. (See: https://stackoverflow.com/a/2306003/481267) However, provided that they are encoded in UTF-8, there are no technical obstacles to supporting Unicode. Many modern file systems allow any character in a file name except \0 and /.

The POSIX API deals with null-terminated byte sequences, and when a string is encoded in UTF-8, no code point's representation contains a zero byte. Furthermore, all characters outside the ASCII range (0x00-0x7f) are encoded entirely using bytes with the high order bit set (0x80-0xff) so there is no chance that the system will be confused into thinking that there is a directory separator in the middle of some Unicode character.

edited May 23 '17 at 11:46

Community

1
1

answered Apr 13 '14 at 03:42

Brian Bi

111,498
10
176
312

When a string is encoded in UTF-8, NUL encodes to a zero byte. [RFC 3629](https://tools.ietf.org/html/rfc3629) says `UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4` and `UTF8-1 = %x00-7F`. You might be thinking of UTF-8 like [variants](http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8) but they are explicitly non-conforming around NUL. – Mike Samuel Apr 13 '14 at 03:59
Yeah, sorry, I meant that for all code points except for NUL itself. – Brian Bi Apr 13 '14 at 04:02

score 2 · Accepted Answer · answered Apr 13 '14 at 03:38

2

It's assumed that in modern Linux/Unix systems unicode filenames are expressed in UTF-8 locale which is byte-oriented (although some of underlying filesystem stores internally filenames in UTF-16).

answered Apr 13 '14 at 03:38

user3159253

16,836
3
30
56

How posix system support unicode?

2 Answers2