WChars, Encodings, Standards and Portability

Question

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"

I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:

Portability and serialization are orthogonal concepts.

Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>

When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:

wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**); you get a type wchar_t which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.
iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.

The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.

So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:

                        my program
    <-- wcstombs ---  /==============\   --- iconv(UTF8, WCHAR_T) -->
CRT                   |   wchar_t[]  |                                <Disk>
    --- mbstowcs -->  \==============/   <-- iconv(WCHAR_T, UTF8) ---
                            |
                            +-- iconv(WCHAR_T, UCS-4) --+
                                                        |
       ... <--- (adv. Unicode malarkey) ----- libicu ---+

Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:

// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>

std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc

int wmain(const std::vector<std::wstring> args); // user starts here

#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
  setlocale(LC_CTYPE, "");
  int argc;
  wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
  return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
  setlocale(LC_CTYPE, "");
  return wmain(parse(argc, argv));
}
#endif
// Serialization utilities

#include <iconv.h>

typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;

U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);

/* ... */

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)

Updates

Following many very nice comments I'd like to add a few observations:

If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.
Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).
File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.

Looks good to me... I might `assert()` that setlocale did not return NULL. (The spec says it returns a string on success and NULL otherwise, but then does not define any actual errors. To me that says to assert that it did not return NULL.) Great question, by the way. — Nemo, Jun 10 '11 at 00:52
Although I do not think `wmain` should be `extern "C"` if it takes a `std::vector`. (I do not think you are supposed to pass a C++ class to a function with C linkage.) — Nemo, Jun 10 '11 at 00:55
Yes, the full code has plenty of checks of return values of `setlocale` and the conversions and `iconv_open()` -- this is more of a conceptual question. I had thought `wchar_t` was a useless monster for the longest time, but suddenly I feel that it's actually a really good idea... — Kerrek SB, Jun 10 '11 at 00:56
This question is actually an answer to many long-standing doubts that I had about the C/C++ standards and how they expect us to use `wchar_t`s & co. +1, and I would give more if I could. :) — Matteo Italia, Jun 10 '11 at 01:07
"you get a type wchar_t which can hold all your system's characters" -- No, it's worse than that. In Windows, wchar_t might only hold half of a surrogate pair. For those characters you need two wchar_t objects to contain an entire character. It could be worse. If I recall correctly, an obnoxious but legal implementation could make wchar_t the same as unsigned char. — Windows programmer, Jun 10 '11 at 06:19
@WP: A surrogate isn't a character. It's part of a serialization method. Granted, with Windows's 16-bit wchars, Windows has access to a smaller range of characters than Linux. But that's just a characteristic of the platform which I'm happy to live with. If you follow my flow-chart, you'll see that I would never face any "surrogates" -- the relevant conversions would simply fail for unrepresentable characters. That's OK. — Kerrek SB, Jun 10 '11 at 07:30
Yes a surrogate isn't a character, and that's exactly why you DON'T get a type wchar_t which can hold all of your system's characters. — Windows programmer, Jun 10 '11 at 07:39
@WP: I think we're talking past each other. Me, I'm saying that internally there are only characters, nothing else. Only when you serialize the characters into a well-defined output stream, such as UTF16, you start getting things like surrogates and byteorder marks and whatnot. If my wchar_t is 16bit, then I simply cannot hold more than 2^16 distinct characters, but note that there is no mention at all about how wchar_t values correspond to characters. mbstowcs gives me "the right thing", but I have no right to suppose anything about the internal representation of characters. — Kerrek SB, Jun 10 '11 at 11:48
Using UTF-16 for `wchar_t` is broken, and cannot be made to work right. The next version of the C standard has new types `char16_t` and `char32_t` to accomodate systems that insist on using UTF-16 internally. — ninjalj, Jun 10 '11 at 18:46
If `__STDC_ISO_10646__` is defined, `wchar_t` values are Unicode codepoints. C1x has `__STDC_UTF_16__` and `__STDC_UTF_32__` for `char16_t` and `char32_t`, respectively, C++0x doesn't seem to have these last two macros. — ninjalj, Jun 10 '11 at 18:58
@Ninjalj: Thanks, that's good to know. It could spare you an iconv-conversion from WCHAR_T to UTF32 if you know that you already have raw codepoints. — Kerrek SB, Jun 11 '11 at 01:35
This question definitely pertains to only C++, as only C++ sample code is posted and I see no way that it's for C. — Puppy, Jun 11 '11 at 11:26
@DeadMG: The question is general and about the flow of data, I just gave an example in C++, but I could equally have done one for C using `wchar_t[]` etc. Could you not have asked me before editing my question? — Kerrek SB, Jun 11 '11 at 23:54
@Dietrich: Typical scenario: `uint32_t in; read_from_file((char*)(&in), 4);. Sure, you could read into a `char[4]` and just use arithmetic, but type punning is often convenient and morally fitting because the i/o byte stream simply doesn't have a type system, so manual coercion is inevitable. Type-ignorant byte-stream serialization often goes well with explicit type casting. — Kerrek SB, Jun 11 '11 at 23:57
@Kerrek: You can do the same thing with an `int`. Neither will create files that can be transferred between different platforms. — Dietrich Epp, Jun 12 '11 at 01:48
@Dietrich: You mean because of endianness? I suppose you should do something like `uint32_t myint = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24);` to read from a byte stream with definite endianness. That way you don't need to cast pointers. I guess what I should have said is that serialization requires manual "typing". — Kerrek SB, Jun 12 '11 at 09:13
@Kerrek: that is again nonportable because `CHAR_BIT` is not guaranteed to be 8—i.e., a byte might be larger than 8 bits. — Philipp, Jun 14 '11 at 05:26
@Philipp: I thought you might say that :-) But a varying bit number seems to put a limit to serialization via `read()`/`write()` anyway, i.e. if I cannot predict how much data `read(1)` will read, then I can't really exchange data between such platforms anyway. So I'm willing to put the stop there. (But perhaps you'll agree that pointer-casting would be a portable way to write code that can serialize among platforms of equal, yet undetermined, bit number?) — Kerrek SB, Jun 14 '11 at 11:32
Only one word to say: read http://www.utf8everywhere.org about how, why, how cold, why it happened, what to do now and what others should. — Pavel Radzivilovsky, Sep 13 '12 at 21:29
@KerrekSB "File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name" This isn't quite true. Most Unix-like filesystems do this. But HFS+ on Mac OS X stores filenames as UTF-16 Normalization Form D (though the standard `char *` APIs accept UTF-8), and filenames are compared case-insensitively. NTFS filenames are 16 bit, I don't believe they do any normalization, but they also compare case-insensitively when interpreted as UTF-16. I have never bothered to find out the exact case mapping algorithm they each use; I'd probably be horrified. — Brian Campbell, Oct 29 '13 at 03:48
@BrianCampbell: It is the *Windows API* that uses NTFS to store UTF-16, but that's not a property of NTFS. With the native API you can store arbitrary 16-bit sequences in filenames, even invalid UTF-16 (much to the discontent of the Windows API). — Kerrek SB, Oct 29 '13 at 08:45

score 24 · Accepted Answer · answered Jun 11 '11 at 21:18

24

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++

No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with

int main(int argc, char** argv)

you have already lost Unicode support for command line arguments. You have to write

int wmain(int argc, wchar_t** argv)

instead, or use the GetCommandLineW function, none of which is specified in the C standard.

More specifically,

any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.

I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.

answered Jun 11 '11 at 21:18

Philipp

48,066
12
84
109

Interesting -- is wmain() not just a wrapper around main() and mbstowcs? I mean, mbstowcs is available on Windows, are you sure that won't work with unicode input? Also, I said "portable", NOT "portable, unicode-capable". Unicode support is explicitly a separate feature, see my reply to Dietrich's answer. Yes, if you want Unicode, you have to include that into your core, no doubt. Me, I was rather after the idea that I can make a small, self-contained prog with console I/O without EVER thinking about encodings and only using the standard C functions and yet get access to lots of characters. – Kerrek SB Jun 12 '11 at 00:08
A thought about filenames: if filenames aren't ASCII, you will simply have to find out from somewhere else which encoding the stdio function `fopen()` requires. You can then convert to that encoding from your internal wide strings. But finding that out is outside the scope of the language standard, I suppose. – Kerrek SB Jun 12 '11 at 00:34
6

@Kerrek: No, `wmain` is not a wrapper around `main`, and `main` doesn't work with Unicode. The true entry point of a Windows console application using the Microsoft runtime is `_wmainCRTStartup`, which gets the command line via `GetCommandLineW`, parses it, and calls `wmain`. – Philipp Jun 12 '11 at 07:00
5

@Kerrek: Regarding filenames. Windows uses UTF-16 for filenames (and for everything else), but you can't use `fopen` to access them. You have to use `_wfopen`, which is nonstandard. If you really want a portable C or C++ program, you can't support Unicode on Windows, and I think that is hardly acceptable nowadays. So better forget about portability... – Philipp Jun 12 '11 at 07:02
@Philipp: Does that mean that `fopen` simply doesn't work on Windows for certain files? I never realized -- what does the C standard say about that? Wouldn't that mean that something is broken? – Kerrek SB Jun 12 '11 at 09:15
@Philipp: Can you have non-BMP characters in Windows filenames? – Kerrek SB Jun 12 '11 at 09:28
5

@Kerrek: I don't think the C standard says anything about filenames. And yes, `fopen` from the Microsoft C runtime doesn't work if you try to open any file whose name isn't representable in the current legacy encoding ("ANSI codepage"). Essentially that means that `fopen` is not usable. – Philipp Jun 12 '11 at 14:16
@Kerrek: I think non-BMP characters are possible in Windows filenames. AFAIK the kernel treats filenames as opaque arrays of 16-bit numbers, so even illegal UTF-16 strings like lone surrogates should be possible (but not advisable, of course). – Philipp Jun 12 '11 at 14:17
@Philipp: It seems that many file systems don't have an explicit notion of encoding and just treat filenames as null-terminated byte strings. That's OK. But would that mean that can't open a file in Windows using `fopen` or `_wfopen` if the filename has non-BMP characters or illegal stuff? (I.e. would you need a kernel function?) – Kerrek SB Jun 13 '11 at 00:23
@Kerrek I haven't tested it, but I'd bet you can open all files with `_wfopen` (not with `fopen`) that you can open with `CreateFileW`. AFAIK `_wfopen` is just a wrapper around `CreateFileW` which translates its arguments and passes them along, but doesn't add additional checks. – Philipp Jun 13 '11 at 06:24
2

Yes, you can open any file with `_wfopen`: That's what it's *for*. But it's Windows-specific. For cross-platform code, you'll need to write a function that calls `_wfopen` on Windows and `fopen` on other systems. – dan04 Jun 13 '11 at 08:46
Oh alright, I understand, for Windows you need just any null-terminated sequence of *16-bit numbers* as a filename, as opposed to a sequence of *bytes* as passed to `fopen`. There's no notion of encoding in Windows either, but to be safe you'd have to perform an internal conversion from WCHAR_T to UTF16LE before using `_wfopen` (if you keep your strings as wchar_t-strings internally as in my flowchart). – Kerrek SB Jun 13 '11 at 09:22
Hey, another question: Is the Windows `_wfopen` identical to `fopen` composed with `mbstowcs`? I mean, if the filesystem always uses 16-bit units, surely there must be some sort of translation inside `fopen`...? So if that's the case, can I just drop `_wfopen` and simply go the otherway, first `wcstombs` and then the ordinary `fopen`? – Kerrek SB Jun 13 '11 at 21:33
1

@Kerrek: No, it is not identical because the encodings used by `fopen` (Windows-1252 etc.) only represent a small subset of Unicode. `fopen` internally calls `CreateFileA`, which in turn translates the filename argument to UTF-16 (presumably using `MultiByteToWideChar`) and calls `CreateFileW`. `_wfopen` calls `CreateFileW` directly. There is no way to avoid calling `CreateFileW` or a wrapper function thereof, and in particular, it is not possible in any way to get Unicode support if you use `fopen` from the Microsoft C runtime. – Philipp Jun 14 '11 at 05:24
@P: Why do you say "no way" if you just said that `CreateFileA` just calls a conversion function internally? Why can't I do `setlocale(...); wcstombs(...); fopen(my_mbs);` and get the same result? Are you assuming (or do you know) that the locale (1252?) must always be a classical fixed-8bit one? (I tried this myself when I got back to my Windows machine last night, and indeed I failed totally to make even umlaut-filenames print correctly (after `myprog *.txt` etc.) in the WinXP `cmd` console; the encoding was ...-1252.) – Kerrek SB Jun 14 '11 at 11:36
I got loads of useful information out of all the discussions, but I have to choose one to accept. I hope you understand that this is fairly arbitrary and I appreciate all contributions greatly! – Kerrek SB Jun 17 '11 at 23:13
3

disagree with recommendation to work with wchar_t. I think char is better for unicode support. Summary of my views is in utf8everywhere.org. – Pavel Radzivilovsky Sep 13 '12 at 21:31

score 9 · Answer 2 · answered Jun 10 '11 at 01:03

9

I would avoid the wchar_t type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use the char16_t and/or char32_t types from C++0x/C1x. (If you don't have a new compiler, typedef them as uint16_t and uint32_t for now.)

DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.

DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.

answered Jun 10 '11 at 01:03

dan04

87,747
23
163
198

1

I think we mean different things by "platform dependent" and "portable". I don't want to swap my RAM content between a PC, a Mac and a Playstation, I just want the program to compile and run on each platform. Ideally I don't want to have to know about _any_ encoding at all! The only time I need to worry about encodings is at the serialization/deserialization stage, which is where I interface using `iconv()`. Internally, I don't want to know anything about the representation of my data. Does that make sense? Like the basic C motto, "values, not representation". – Kerrek SB Jun 10 '11 at 01:06
2

Also, by your reasoning `int` is platform dependent because its 32 bit here and 64 bit there -- yes, types may have different ranges on different platforms, but that doesn't make something not portable -- it just makes it behave differently. E.g. Windows XP doesn't let me use non-BMP unicode characters but Linux does. Fine. That's what you get for being native. – Kerrek SB Jun 10 '11 at 01:09
1

UTF-32 isn't really "native" for Linux the way UTF-16 is for Windows: All the POSIX API functions (that aren't specifically related to wide-character handling) use `char*` strings. – dan04 Jun 10 '11 at 02:10
The Windows API is a different story. Its MultiByte* functions actually tell you that they produce Unicode. Me, I'm only interested in standard-C. I believe that `` does provide wide versions of all the standard functions, e.g. `wcstoul` and `wcscmp` etc. No _encoding_ is native, because the language standard doesn't talk about i/o serialisation formats. – Kerrek SB Jun 10 '11 at 11:55

score 9 · Answer 3 · answered Jun 11 '11 at 11:35

9

The problem with wchar_t is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of the w* functions like wcscat and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.

Here are some things that much harder with wchar_t than they are if you just pick one of the UTF encodings:

Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).
HTML: How do you turn 𐀀 into a string of wchar_t?
Text editor: How do you find grapheme cluster boundaries in a wchar_t string?

If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of wchar_t is somewhat irrelevant as I don't consider it an especially useful data type.

Your program requirements may differ and wchar_t may work fine for you.

answered Jun 11 '11 at 11:35

Dietrich Epp

205,541
37
345
415

Good point, I think you really hit the issue here that it all depends on what you want to do with the data. If explicitly-unicode text processing is a core part, then by all means the transformation to, say, UTF32 as the primary internal program should be part of the core, not the I/O (i.e. the input is mbsrtowcs -> iconv(WCHAR_T -> UTF32); output is the reverse). Just adapt my ASCII art chart above accordingly... – Kerrek SB Jun 12 '11 at 00:02
... On the other hand, if text strings play a purely ancillary role in your program (e.g. player names printed on the final score screen), then restricting ourselves to the available system characters is perfectly reasonable. About HTML: You'll have to know the page's encoding! If it's, say, UTF32, then just do iconv(UTF32->WCHAR_T) on U"\65536"; either it works or it fails. Your Text and JS examples clearly mandate explicit handling of Unicode, so see above. (The text example will probably even require sophisticated unicode stuff, e.g. see libicu.) – Kerrek SB Jun 12 '11 at 00:06
Also, I agree that the utility of an abstract "string" type without knowing its encoding may be fairly limited. But what I could definitely do is comparing and matching, even with literal constants a la `L"foo"`, so I think that there could also be plenty of situations where I need _some_ sort of string handling, but I never need to know particulars about the encoding -- e.g. read stuff from stdin, assign seat numbers to each and output the result to stdout. – Kerrek SB Jun 12 '11 at 00:22
1

@Kerrek: While true that you don't always need to know which encoding you're using, it can be difficult to predict whether that applies to your project. Choosing a specific encoding (UTF-8/16/32) is relatively safe, and except for a few platform-specific APIs, I don't see any benefit to `wchar_t`. It's worse if you consider that a portable program (according to the spec) is not allowed to assume that `wchar_t` can store an arbitrary Unicode string, even after conversion. – Dietrich Epp Jun 12 '11 at 01:46
I suppose practically that makes sense. I guess there's a theoretical possibility that your environment uses an entirely obscure encoding that you don't know and can't make, so that you need to use `wcstombs` to create usable output, and you need to go via an internal `wchar_t`-string. But realistically, when the locale uses UTF8, then an internal 16-bit `wchar_t` representation does indeed limit you unnecessarily. I think my real question is then how I should treat the stdin data if not via `mbstowcs`. – Kerrek SB Jun 12 '11 at 09:22

Luc Danton · Answer 4 · 2011-06-11T10:19:55.640

6

Given that iconv is not "pure standard C/C++", I don't think you are satisfying your own specifications.

There are new codecvt facets coming with char32_t and char16_t so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.

The facets are described in 22.5 [locale.stdcvt] (from n3242).

I don't understand how this doesn't satisfy at least some of your requirements:

namespace ns {

typedef char32_t char_t;
using std::u32string;

// or use user-defined literal
#define LIT u32

// Communicate with interface0, which wants utf-8

// This type doesn't need to be public at all; I just refactored it.
typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;

inline std::string
to_interface0(string const& s)
{
    return converter0().to_bytes(s);
}

inline string
from_interface0(std::string const& s)
{
    return converter0().from_bytes(s);
}

// Communitate with interface1, which wants utf-16

// Doesn't have to be public either
typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;

inline std::wstring
to_interface0(string const& s)
{
    return converter1().to_bytes(s);
}

inline string
from_interface0(std::wstring const& s)
{
    return converter1().from_bytes(s);
}

} // ns

Then your code can use ns::string, ns::char_t, LIT'A' & LIT"Hello, World!" with reckless abandon, without knowing what's the underlying representation. Then use from_interfaceX(some_string) whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g. codecvt_utf8 can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (ditto codecvt_utf16).

In fact I wrote the above to be as short as possible but you'd really want helpers like this:

template<typename... T>
inline ns::string
ns::from_interface0(T&&... t)
{
    return converter0().from_bytes(std::forward<T>(t)...);
}

which give you access to the 3 overloads for each [from|to]_bytes members, accepting things like e.g. const char* or ranges.

edited Jun 11 '11 at 10:19

answered Jun 10 '11 at 01:37

Luc Danton

34,649
6
70
114

iconv can't be "pure standard", because the pure standard has no notion of encoding at all. That's why I only want to use iconv at the i/o interface end. Ideally I don't want to "pick one encoding" internally, because encodings aren't programming concepts -- they're serialization concepts. While I'm not serializing, I would feel dirty if I had to mention an explicit encoding. – Kerrek SB Jun 10 '11 at 07:27
1

What do you mean, mention? You can refactor that away in e.g. a typedef (but you still will have to settle for a given literal, unless using macros). The correct overloads are picked for whatever conversions are needed when interfacing with something. And if you feel that "encoding aren't programming concepts" then why not pick UTF-32? – Luc Danton Jun 10 '11 at 07:33
By "mention" I mean that if I write `'a'` or `L'a'`, I get "the character 'a'", but I have absolutely no right to suppose anything about how that's implemented (in particular that it's integrally 97). _All_ I am guaranteed is that char can hold an `'a'` and wchar_t a `L'a'`. No typedefs, no choices, no encodings. Just the character 'a'. – Kerrek SB Jun 10 '11 at 11:50
That's interesting, I hadn't really given C++ locale support a thought. So what is my program-internal string type, and how do I read the command line arguments, say? E.g. what's the equivalent of `setlocale(LC_CTYPE, ""); mbsrtowcs(buf, &argv[i], N, 0);`, which creates an internal, opaque, wide string (without me needing to think about encodings)? Do those "facets" do the same job as iconv? – Kerrek SB Jun 11 '11 at 09:07
@Kerrek Should I assume that the input to the program is in utf-8? Otherwise a `std::copy` would be correct even in your own example. – Luc Danton Jun 11 '11 at 09:41
@Kerrek I am somewhat in disbelief. From the draft Standard I can see how to go from the narrow set to the wide set, and from there on to any Unicode encoding. Are you still interested in that? – Luc Danton Jun 11 '11 at 10:14
@Luc: I don't want to have to make any assumptions or know anything about the environment. That's why I was hoping that I can just use `setlocale(LC_CTYPE, "");` and `mbsrtowcs()` (possibly followed by iconv I feel like I need a specific encoding internally). If I'm reading from a file, I'd have to know the encoding of course and I'd use iconv(file-enc -> WCHAR_T) rather than mbsrtowcs. [That was for your -3rd comment, I didn't nevermind it after all ;-). Please do say if you know how to do things in a C++ way with facets!] – Kerrek SB Jun 12 '11 at 00:16
@Kerrek After having slept on the problem I'm in the process of rewriting this answer. – Luc Danton Jun 12 '11 at 00:19
Oh, just thinking, the platform should probably provide a method to bridge between program console I/O and file encodings to deal with redirection. If you store the standard output in a file, you have to be able to learn the locale's encoding, since you'll have received whatever wcstombs created. That's the platform's responsibility, though, not the programmer's. – Kerrek SB Jun 12 '11 at 00:25
1

@Kerrek After a bit of looking around, while it is possible to convert from (char, narrow encoding) to (wchar_t, wide encoding), and it is possible to convert from any ([char, char16_t, char32_t], [utf-8, utf-16, utf-32]) pair to any almost other, the Standard doesn't provide a way to go from the implementation encodings to Unicode ones and back. I won't salvage this answer and I recommend Philipp's. – Luc Danton Jun 12 '11 at 04:42
@Luc: Alright, cheers, and thanks for your input! I see that an alternative design could be to avoid the `setlocale`/`mbstowcs` operations altogether and instead find a different means of discovering the locale's encoding (how?). With that, one could use `iconv` directly to move from the stdin to a deterministic internal encoding (UTF32). Perhaps that's more practical. But I'm also concerned about the portability of `fopen` now as Philipp brought it up... – Kerrek SB Jun 12 '11 at 09:26
@Kerrek Now I'm wondering if `std::use_facet>(loc)` (where char_type is `char16_t` or `char32_t`) is a legitimate way of finding out if your implementation is using (char, utf-8)/(char_type, utf-16/utf-32) for their narrow/wide sets. Seems hackish. – Luc Danton Jun 12 '11 at 09:42
@Luc: I'll have to read up some background on locales and facets in C++ -- I never looked at those seriously before you brought them up. – Kerrek SB Jun 12 '11 at 09:57
I looked up locales now, and specifically the `codecvt` facet. Unfortunately, that one seems rather useless: the generic version has no features, and the specialization to "internal wchar_t, external char" provides basically only a verbose, cumbersome wrapper around `mbstowcs`/`wcstombs`. So I don't imagine it would clear up anything to make that part of the design. – Kerrek SB Jun 13 '11 at 21:10
@Kerrek I was going by the n3242 draft of C++0x, which defines new specializations for `codecvt` *and* add new `codecvt_utfX` types. And more to the point, the very convenient `wstring_convert` which wraps the ugliness of `codecvt`. – Luc Danton Jun 13 '11 at 21:16
@Luc: Oh I see -- let me look at the new standard again then (and try whether GCC has support for it)! – Kerrek SB Jun 13 '11 at 21:34
@Kerrek My month-old snapshot doesn't :(. The `codecvt_utfX` are supposed to reside inside `` but there's no such header. There's no `wstring_convert` either, nor the required `codecvt` specializations for `char16_t` and `char32_t`. – Luc Danton Jun 13 '11 at 21:42
Right, I see. But if it did exist, then conversion with `wstring_convert>` would nicely encapsulate by hand-written conversion functions (not included in my code snippets above). And I guess one could build a locale with that facet and imbue cin/cout with that and it would automagically do the right thing when using standard in/out? – Kerrek SB Jun 13 '11 at 22:01
@Kerrek It's complicated. First, you'd want to imbue the wide streams, not the narrow ones. Then, AFAIK `std::wcout` 'speaks' (well, wants) wide set encoding. But there's no `codecvt` that does Unicode (any kind) <-> wide (or even narrow) set encoding. They all convert to/from multibyte Unicode! – Luc Danton Jun 13 '11 at 23:13
I'm wrong about which particular streams to imbue: in both cases there's no overload that will accept a character type other than the one it deals with. So can't pass `const char32_t*` to anything else than `std::basic_stream`! This does mean however that if there were a `codecvt` for UTF-8 <-> narrow set encoding then you could, in fact, imbue e.g. `std::cout` and then pass a `u8""` literal with portable results. Probably the hottest thing to wait for from a possible (?) Boost.Unicode! – Luc Danton Jun 13 '11 at 23:22
1

Folks - you know we have an excellent chat feature where you can carry on this fascinating discussion. :) – Kev Jun 13 '11 at 23:45
You know, I finally downloaded a copy of libc++ and made `wstring_convert` work and thought I should update this question, and it turns out you've already said everything I wanted to say two years ago :-S – Kerrek SB Jan 01 '14 at 17:36

WChars, Encodings, Standards and Portability

4 Answers4

Linked