0

I have seen a lot of API posix system, for example Linux & Mac & android, which accept const char* as the argument for file path.

One of example is dlopen, as the document show, the first argument is const char*, so does it support Unicode file path, for example path with Chinese?

Brian Bi
  • 111,498
  • 10
  • 176
  • 312
ZijingWu
  • 3,350
  • 3
  • 25
  • 40
  • 2
    FYI: There are multiple Unicode encodings: UTF-8, which is favored for interchange and on Unix systems for everything, UTF-16 which Windows uses due to baskwards-compatibility-constraints and UTF-32 which can represent every Unicode Codepoint as a unit of constant length. All others are only interesting in niche applications. – Deduplicator Apr 13 '14 at 03:47
  • @Deduplicator, I want to wrapper an cross platform Unicode API, I don't want to consider performance and memory usage, because the difference will be small. So for productivity and easy usage, any thoughts about the pros and cons for UTF-8 or UTF-32? – ZijingWu Apr 13 '14 at 03:57
  • 1
    Go for all UTF-8, that way has the least gotchas. Even on systes having a UTF-16 API, everything else is done in UTF-8, so you are forced to convert there anyway. Just be sure you always use the native Unicode API always, perhaps with a specialised widen/narrow converter. http://utf8everywhere.org – Deduplicator Apr 13 '14 at 14:52

2 Answers2

3

POSIX is not required to support Unicode filenames. (See: https://stackoverflow.com/a/2306003/481267) However, provided that they are encoded in UTF-8, there are no technical obstacles to supporting Unicode. Many modern file systems allow any character in a file name except \0 and /.

The POSIX API deals with null-terminated byte sequences, and when a string is encoded in UTF-8, no code point's representation contains a zero byte. Furthermore, all characters outside the ASCII range (0x00-0x7f) are encoded entirely using bytes with the high order bit set (0x80-0xff) so there is no chance that the system will be confused into thinking that there is a directory separator in the middle of some Unicode character.

Community
  • 1
  • 1
Brian Bi
  • 111,498
  • 10
  • 176
  • 312
  • When a string is encoded in UTF-8, NUL encodes to a zero byte. [RFC 3629](https://tools.ietf.org/html/rfc3629) says `UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4` and `UTF8-1 = %x00-7F`. You might be thinking of UTF-8 like [variants](http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8) but they are explicitly non-conforming around NUL. – Mike Samuel Apr 13 '14 at 03:59
  • Yeah, sorry, I meant that for all code points except for NUL itself. – Brian Bi Apr 13 '14 at 04:02
2

It's assumed that in modern Linux/Unix systems unicode filenames are expressed in UTF-8 locale which is byte-oriented (although some of underlying filesystem stores internally filenames in UTF-16).

user3159253
  • 16,836
  • 3
  • 30
  • 56