Standard way in C11 and C++11 to convert UTF-8?

Question

C11 and C++11 both introduce the uchar.h/cuchar header defining char16_t and char32_t as explicitly 16 and 32 bit wide characters, added literal syntax u"" and U"" for writing strings with these character types, along with macros __STDC_UTF_16__ and __STDC_UTF_32__ that tell you whether or not they correspond to UTF-16 and UTF-32 code units. This helps remove the ambiguity about wchar_t, which on some platforms was 16 bit and generally used to hold UTF-16 code units, and on some platforms was 32 bit and generally used to hold UTF-32 code units; assuming those macros are now set, you can now write portable, unambiguous code referring to UTF-16 and UTF-32. __STDC_ISO_10646__ can also be used as a proxy to determine whether wchar_t is capable of holding UTF-32 values; if it can't, you can't necessarily assume that it holds UTF-16, but it's probably a close enough approximation to be portable.

They also add the functions mbrtoc16, mbrtoc32, c16rtomb, and c32rtomb for converting between multibyte characters and these types. Between these and the existing mbstowcs family of functions, it's possible to translate between UTF-16, UTF-32, the platform multibyte character set, and the platform wide character set portably (though not necessarily losslessly unless the platform defined multibyte and wide character sets are UTFs; in particular, it seems like these functions will be fairly useless on Windows where the locale defined multibyte encoding is not allowed to use more than two bytes per character).

Furthermore, they added the u8"" syntax for writing literal UTF-8 encoded strings. As UTF-8 is an encoding that is compatible with most functions that deal in char * and std::string, this is one of the most useful new additions.

However, they seem to have failed to add any way to portably convert between UTF-8, UTF-16, and UTF-32. The mbtoc16 and related functions convert between the implementation defined multibyte encoding and UTF-16 or 32; but you can't depend on this being UTF-8. On Unix-like platforms it's dependent on the locale, and many of them use UTF-8 in their locale by default, and even if it's not the default you can at least set the locale to a UTF-8 locale for the purposes of knowing that "multibyte" means UTF-8. On Windows, however, you explicitly can't use UTF-8 or any other encoding that requires more than two bytes for the locale.

Am I just missing something, or is the UTF-8 string type not accompanied by any way to convert it to the other types of strings: platform defined multibyte, platform defined wide char, UTF-16, or UTF-32? Is there no way to even tell if your system multibyte encoding is UTF-8? Is there any reason why this support wasn't included (specifically, I'm looking for actually written justification or discussion by the C or C++ standards committees, not just speculation)? Is there any work being done to improve this situation; is it likely to improve in the future?

Or, is the current best solution, if you want to support UTF-8 in a portable fashion, to write your own implementation, pull in a library dependency, or use platform-specific functions like iconv and MultiByteToWideChar?

Note that this question is fairly similar to [WChars, Encodings, Standards and Portability](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability), but I'm more interested in the reasoning behind the standard committees' decisions and whether there's progress towards standardizing this in a way that actually works than the "what is the best approach right now". — Brian Campbell, Oct 29 '13 at 03:49
"`__STDC_ISO_10646__` can also be used as a proxy to determine whether wchar_t is capable of holding UTF-32 values;" Nope. For example OS X doesn't define this but uses UTF-32 for the `xx_XX.UTF-8` locales. The reason it doesn't define the macro is because for non .UTF-8 locales it may use non-UTF-32 encodings. If you want to know of a locale uses UTF-32 you'll just have to information specific to the locale. If you want to know if you can use wchar_t for your own UTF-32 data and routines then you just need to ensure that `sizeof(wchar_t)*CHAR_BIT >= 21`. — bames53, Oct 29 '13 at 15:51
@bames53 Sigh, you're right. It looks like that can't even be relied upon. So what I'm wondering about is: why isn't this standardized? What barrier is there to writing a standard that will actually allow you to write actually portable code involving Unicode? Right now, the standard is pretty much useless; you need to write code for each platform separately. Is this something they have actually tried and failed, or is it just the case that no one is particularly interested in standardizing an actually useful set of Unicode routines in C? — Brian Campbell, Oct 29 '13 at 16:34
I imagine that one of the big problems is that every platform already has some way of dealing with it and standardizing something would conflict with one or another existing implementation. The only way to avoid that is probably to standardize a Unicode library with no relation to the existing standard functionality. — bames53, Oct 29 '13 at 17:04
And that means, just as an example, a new version of `fopen()`. This is because to portably use UTF-8 it would have to be specified what encoding such library components take, which is not currently done. On some platforms fopen takes the system locale multibyte encoding, on some it takes an arbitrary byte sequence, on some it takes UTF-8 regardless of the locale. — bames53, Oct 29 '13 at 17:05
@bames53 I don't mean standardizing which encoding `fopen()` et al take. I mean standardizing conversion between UTF-8 and either UTF-16 or UTF-32, since we now have types for all of these. Furthermore, something to standardize converting between UTF-16 or 32 and `wchar_t`. The fact that you can only do certain transitions (multibyte to wide char and multibyte to UTF-16/32) in a portable manner is what gets me, and that you can't tell whether either the multibyte or wide character format is capable of representing all of Unicode. — Brian Campbell, Oct 29 '13 at 18:29
Check out http://stackoverflow.com/questions/38688417/utf-conversion-functions-in-c11 — Brent, Aug 05 '16 at 22:44

score 0 · Accepted Answer · answered Oct 29 '13 at 03:41

0

Sounds like you're looking for the std::codecvt type. See the example on that page for usage.

answered Oct 29 '13 at 03:41

MikeP

7,829
33
34

2

Ah, that answers the question for C++11. I'm still curious about C11. I may update my question to just be about C11, though since Microsoft has explicitly declined to support anything beyond C89, asking questions about portability to Windows is probably futile. Interesting that, as you can see on the chart at the bottom of that page, for some of the conversions you can use `std::codecvt`, for some you have to use the C style conversion functions, and some of the conversions don't directly exist though you can compose them out of combinations of the others. – Brian Campbell Oct 29 '13 at 03:55
2

Also note that the standard codecvts only provide conversions between UTF-8 and UTF-16, or UCS-2, or UTF-32, not UTF-8 and the platform character sets. For that you need a two-pass conversion, with something like `c32rtomb`. – R. Martinho Fernandes Oct 29 '13 at 09:25
1

@R.MartinhoFernandes Yeah, that is somewhat odd (as I noted; some of them don't directly exist but can be composed). Do you know why some of the conversions were specified in the C style API, some in C++, and some were left out? It all seems very ad-hoc. – Brian Campbell Oct 29 '13 at 18:30
1

@BrianCampbell C specified the bits they cared about for themselves. C++ specified the bits they cared about for themselves. Then C++ also picked up what C had specified just by default. These are not intended to be used together. – bames53 Oct 29 '13 at 18:44
@BrianCampbell I don't know how the decision process went, but my impression from looking at the functionality provided, and how it was provided, was that the whole thing was done in a very very ad hoc manner, possibly just rushed. It seems there's a dearth of people involved with the committee that care and/or know enough about the subject matter, and that doesn't make for much progress in it. – R. Martinho Fernandes Oct 30 '13 at 09:28
@R.MartinhoFernandes Do you know how to get involved with the committee? Many standards bodies, like the IETF, W3C, WHATWG, Austin Group, etc do a lot of work on mailing lists that you can simply subscribe to. The ISO C committee seems to be more complicated; they appear to have lots of face to face meetings, and as an international organization it looks like you participate as members of your national standards body. Is it possible, as an individual who is interested in improving the situation, to make a proposal somewhere, that could actually be included in a future standard? – Brian Campbell Oct 30 '13 at 15:46
@Brian I don't know about C, but I can tell what I know about C++. There's a semi-open process. The committee members have their own private internal mailing lists, but their regular meetings are open to anyone. You can become a committee member through your national standards body, and AFAIK the each of those has their own rules. Look up on yours. They also accept proposals openly, but in order to get it accepted, it is important to have someone that can champion the proposal in one of the face-to-face meetings. (cont) – R. Martinho Fernandes Oct 30 '13 at 15:54
(cont) I know that some committee members and some regular non-member attendees of the meetings are willing to champion other people's proposals, if you reach out to them and they are interested/able, of course. There's also an "official informal" forum for discussion of proposals at https://groups.google.com/a/isocpp.org/forum/#!forum/std-proposals. If you write up a proposal (official guidelines here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3370.html#Guidelines), you can ask for a champion there. Alternatively you can attend the next meeting yourself. (cont) – R. Martinho Fernandes Oct 30 '13 at 15:57
(cont) FWIW, I am aware of this recent proposal for more Unicode support in the C++ standard library http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3572.html because I significantly helped the person that wrote it. You may want to look at it first, and search for feedback on it on the proposals forum (searching for "Unicode" or "N3572" there should provide useful results). That's all I can think of right now :) Hope that helps. – R. Martinho Fernandes Oct 30 '13 at 16:02
Feel free to ping me in [the Lounge](http://chat.stackoverflow.com/rooms/10/loungec) if you still have questions; I'll be glad to help, be it with the process, or with the technical details. And sorry for the long comments. – R. Martinho Fernandes Oct 30 '13 at 16:03
@R.MartinhoFernandes Thanks! This has gone a bit beyond what the comment section is intended for, but I guess this really wound up being more of a "discussion" question than an answerable question. – Brian Campbell Oct 31 '13 at 05:30
How do you use `std::codecvt` to convert to or from a "narrow" encoding (which uses `char`) and UTF-8 (which also uses `char`)? – Raedwald Dec 11 '17 at 09:32

Standard way in C11 and C++11 to convert UTF-8?

1 Answers1

Linked