14

What is the fate of wchar_t in c++0x considering the new character types char8_t, char16_t, and char32_t?

More importantly, what about std::wstring, std::wcout, etc?

Are the w* family classes deprecated?
Are there new std::ustring and std::Ustring classes for new character types?

deft_code
  • 57,255
  • 29
  • 141
  • 224
  • 1
    See http://stackoverflow.com/questions/872491/new-unicode-characters-in-c0x. It doesn't answer all your questions (i.e. deprecation), but I guess wchar_t isn't going to be deprecated. There's too much existing code already using it. – Boaz Yaniv May 13 '11 at 20:42
  • 2
    @Boaz Yaniv: Not to mention that deprecation usually doesn't mean anything. Implementors implement deprecated things because they need to compile old software, and nobody's going to rewrite old software just because of a deprecation warning. – David Thornley May 13 '11 at 20:56
  • No one is going to rewrite bad software over a deprecation warning but honestly find and replace isn't that big of a deal. We've already done away with NULL in favor of `nullptr` in all our code. – AJG85 May 13 '11 at 21:24
  • @AJG: The main problem with replacing wchar_t with char16_t (or whatever is applicable to your platform) as I see it is that many existing libraries are dependent on it. And although you can rather easily change your own code, you usually don't want to touch 3rd-party libraries, and at least some of the library writers would be wary of changing their libraries and breaking existing code. – Boaz Yaniv May 13 '11 at 21:31
  • 2
    @David: especially in C++. In 03, at any rate, deprecation is defined to mean "the feature may be removed in a future version of the standard". So conforming compilers *must* implement it. And it turns out that even non-deprecated features may be removed in future versions of the standard, since C++0x has some backward incompatibilities unrelated to things deprecated in C++03. So all deprecation really means is, "we're not sure we really wanted to put this in, but we did. kthxbye, the authors". – Steve Jessop May 13 '11 at 22:20
  • Why would you *ever* want to replace `wchar_t` with `char16_t`? With `wchar_t` you *might* be able to hold a Unicode character (it can on my machines, since `sizeof(wchar_t)` for me is always 4), whereas with char16_t, you are **guaranteed to be unable to hold a Unicode character.** Why in the world would you want to do such a daft thing??? – tchrist May 13 '11 at 22:41
  • 1
    Because the Windows API uses UTF-16. – dan04 May 14 '11 at 04:14
  • 4
    @tchrist: same reason you might use `int32_t` instead of `long` - because you prefer to code without the existential doubt and uncertainty of not knowing what range of values your type holds. Depending what the code does, removing possibilities might make it easier to reason about it, since all platforms will behave (closer to) the same. Also, unicode literals have type `char16_t[]` (for `u`) or `char32_t[]` (for `U`), not type `wchar_t[]` (which is `L`). I don't see the fascination with UTF-16, but some people (MS) seem to like it. – Steve Jessop May 14 '11 at 10:51
  • 1
    Microsoft was an earlier adopter of Unicode (UCS-2), back when it was assumed that 65,536 characters would be enough for everyone. When Unicode was expanded beyond the BMP, using UTF-16 instead of UTF-32 allowed more backwards compatibility. – dan04 May 14 '11 at 18:09
  • 1
    @dan04: sorry, I was being flippant. I do see the fascination: for almost all text it's half the size of UTF-32, and Windows is locked into it for legacy reasons. Furthermore, a lot of the difficulties of handling UTF-16 (variable length characters) are actually still present in UTF-32 due to combining marks. In fact the fundamental Unicode difficulties are harder, because canonical equivalencies are harder. So using 4 bytes per code point doesn't make it easier to e.g. reverse a string properly, just easier to claim you've done enough and that you won't support combining characters. – Steve Jessop May 15 '11 at 11:40
  • @Steve Jessop: Sure, but removing deprecated features usually doesn't affect the compilers much, since implementers usually keep them in for backward compatibility. About the only way to remove old features is to overwrite them: no C++0x-compliant compiler can have the old meaning for `auto`, for example. – David Thornley May 16 '11 at 13:53
  • @David, deprecated features are _normative_ (see annex D of either standard). That means compiler writers are required to implement them. Here is real world example that illustrates this. The EDG compiler is only one to support exporting templates. The committee wanted to deprecated them. EDG asked that they just be removed instead so that they don't have to continue to support it. As EDG was the only compiler with a working implementation that is what the committee did, exported templates are not deprecated in c++0x, they just are not there. – deft_code May 16 '11 at 14:49
  • @deft_code: Sure, but if a few compilers had supported exported templates, and they had customers that used the feature, would they remove it because it wasn't in C++0x? Deprecation should be a signal (not completely accurate) not to use that feature, because it might not be in future standards. (As you say, a conforming implementation must implement them.) Programmers will normally still use deprecated features, software written with them will hang around unchanged, and a compiler vendor will feel compelled to support them even when removed from the standard. – David Thornley May 16 '11 at 16:09

1 Answers1

8

Nothing happens to wchar_t, it is still implementation specific (and compatible with C).

The new types char16_t and char32_t have defined semantics in the new standard. The old wchar_t might be equivalent to one of those, but likely to a different one on different implementations. Or none of them, on some systems.

You will have typedefs u16string and u32string for strings of the new character types, but no new standard streams.

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
  • Can you confirm that std::string should contain utf8 chars? Or is there another type for this? u8string? – Klaim May 13 '11 at 23:42
  • 3
    There is no `u8string`. `char` has an overloaded meaning of "UTF-8 code unit", "member of the basic execution character set", or "byte". – dan04 May 14 '11 at 04:19
  • 2
    @Klaim - Like Dan says, std::string could (not should) contain UTF-8. It is up to the application to decide the interpretation. The language already has three narrow character types, and the committee was hesitant to add a fourth! – Bo Persson May 14 '11 at 06:05