3

Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list

I have created this one to clarify.

  1. What are the encodings used for C style strings?
  2. Is Linux using UTF-8 to encode strings?
  3. How does external encoding relate to the encoding used by narrow and wide strings?
Community
  • 1
  • 1
Šimon Tóth
  • 35,456
  • 20
  • 106
  • 151

2 Answers2

4
  1. Implementation defined. Or even application defined; the standard doesn't really put any restrictions on what an application does with them, and expects a lot of the behavior to depend on the locale. All that is really implemenation defined is the encoding used in string literals.

  2. In what sense. Most of the OS ignores most of the encodings; you'll have problems if '\0' isn't a nul byte, but even EBCDIC meets that requirement. Otherwise, depending on the context, there will be a few additional characters which may be significant (a '/' in path names, for example); all of these use the first 128 encodings in Unicode, so will have a single byte encoding in UTF-8. As an example, I've used both UTF-8 and ISO 8859-1 for filenames under Linux. The only real issue is displaying them: if you do ls in an xterm, for example, ls and the xterm will assume that the filenames are in the same encoding as the display font.

  3. That mainly depends on the locale. Depending on the locale, it's quite possible for the internal encoding of a narrow character string not to correspond to that used for string literals. (But how could it be otherwise, since the encoding of a string literal must be determined at compile time, where as the internal encoding for narrow character strings depends on the locale used to read it, and can vary from one string to the next.)

If you're developing a new application in Linux, I would strongly recommend using Unicode for everything, with UTF-32 for wide character strings, and UTF-8 for narrow character strings. But don't count on anything outside the first 128 encoding points working in string literals.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
-1
  1. This depends on the architecture. Most Unix architectures are using UTF-32 for wide strings (wchar_t) and ASCII for (char). Note that ASCII is just 7bit encoding. Windows was using UCS-2 until Windows 2000, later versions use variable encoding UTF-16 (for wchar_t).
  2. No. Most system calls on Linux are encoding agnostic (they don't care what the encoding is, since they are not interpreting it in any way). External encoding is actually defined by your current locale.
  3. The internal encoding used by narrow and wide strings is fixed, it does not change with changing locale. By changing the locale you are chaning the translation functions that encode and decode data which enters/leaves your program (assuming you stick with standard C text functions).
Šimon Tóth
  • 35,456
  • 20
  • 106
  • 151
  • Being "encoding agnostic" doesn't mean that UTF-8 isn't used. In fact, it's widely used. And #2 and #3 are incompatible with each other (#2 says there is no translation of encoding, #3 says internal encoding is fixed and translation occurs at the edge) – Ben Voigt Sep 21 '11 at 13:58
  • 1
    @BenVoigt The translation is done inside the C standard library. And I'm done, since you are obviously trolling. – Šimon Tóth Sep 21 '11 at 14:00
  • @Let_Me_Be The internal encoding most definitely does depend on the locale, at least for narrow characters, and changing locales will cause functions like `isupper` to return different values. And how the system interprets external coding depends on context; when displaying it in an `xterm`, for example, it uses the font encoding, regardless of the locale (but the order in which `ls` sorts the file names depends on the locale). – James Kanze Sep 21 '11 at 14:21
  • System calls are not encoding agnostic. The Win32 wide API wants UTF-16. Linux doesn't even have wide versions of its system API (and neither does any POSIX function), it does "just work" for UTF-8 encoded char arrays, if the system Locale is UTF-8 (which it is on all modern distro's). – rubenvb Sep 21 '11 at 14:23
  • @rubenvb On Linux, the system calls just work for the encoding that is the current locale. That is the definition of encoding agnostic. – Šimon Tóth Sep 21 '11 at 14:25
  • Let_Me_Be: not really, then it works for only one locale. It expects only one locale. Agnostic would work for all locales, as it would not know anything about a locale. Or that is how "agnostic" is defined in my dictionary... – rubenvb Sep 21 '11 at 15:00
  • @ruvenvb OK, you are correct in that. But they are actually encoding agnostic. The fact that correct pairs work correctly is just due to the fact that the tools displaying the encoded string are decoding them correctly (because the encodings match). System calls just take what you give them and write it down (that are system calls, not standard C functions that we are talking about now). – Šimon Tóth Sep 21 '11 at 15:09
  • Hey, you can't ask a question and then instantly answer it. That's cheating. – Puppy Sep 21 '11 at 19:32