12

NB: I'm sure someone will call this subjective, but I reckon it's fairly tangible.

C++11 gives us new basic_string types std::u16string and std::u32string, type aliases for std::basic_string<char16_t> and std::basic_string<char32_t>, respectively.

The use of the substrings "u16" and "u32" to me in this context rather implies "UTF-16" and "UTF-32", which would be silly since C++ of course has no concept of text encodings.

The names in fact reflect the character types char16_t and char32_t, but these seem misnamed. They are unsigned, due to the unsignedness of their underlying types:

[C++11: 3.9.1/5]: [..] Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively [..]

But then it seems to me that these names violate the convention that such unsigned types have names beginning 'u', and that the use of numbers like 16 unqualified by terms like least indicate fixed-width types.

My question, then, is this: am I imagining things, or are these names fundamentally flawed?

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055

3 Answers3

16

The naming convention to which you refer (uint32_t, int_fast32_t, etc.) is actually only used for typedefs, and not for primitive types. The primitive integer types are {signed, unsigned} {char, short, int, long, long long}, {as opposed to float or decimal types} ...

However, in addition to those integer types, there are four distinct, unique, fundamental types, char, wchar_t, char16_t and char32_t, which are the types of the respective literals '', L'', u'' and U'' and are used for alpha-numeric type data, and similarly for arrays of those. Those types are of course also integer types, and thus they will have the same layout at some of the arithmetic integer types, but the language makes a very clear distinction between the former, arithmetic types (which you would use for computations) and the latter "character" types which form the basic unit of some type of I/O data.

(I've previously rambled about those new types here and here.)

So, I think that char16_t and char32_t are actually very aptly named to reflect the fact that they belong to the "char" family of integer types.

Community
  • 1
  • 1
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • 1
    This interpretation that `char16_t` and `char32_t` belong to a different "family" of types quite neatly works around what I still think would otherwise be a somewhat misleading inconsistency in the standard. That'll do! Thanks – Lightness Races in Orbit Oct 09 '12 at 12:26
  • I thought `char16_t` and `char32_t` were typedefs in C, and C++ made them primitive types to allow for overloading. – Jesse Good Oct 09 '12 at 21:40
  • @JesseGood: Why would you think that? Check C11 7.28. – Kerrek SB Oct 09 '12 at 21:44
  • I'm not up to date with C, but I was reading the original proposal [here](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html) which said: `N1040 defined char16_t and char32_t as typedefs to uint_least16_t and uint_least32_t, which make overloading on these characters impossible`, perhaps something changed after that before the C11 standard. – Jesse Good Oct 09 '12 at 22:03
  • The wording in 7.28 is vague, but after some more searching, I believe they are typedefs in C11, see [here](http://en.cppreference.com/w/c/string/multibyte) which lists them as typedefs in ``. – Jesse Good Oct 09 '12 at 22:13
  • @JesseGood: Actually, you're right: It says, "the header declares types", which means they're typedefs. Well spotted. – Kerrek SB Oct 09 '12 at 22:16
  • @JesseGood: Well, that gives me a C question to ask then ;) – Lightness Races in Orbit Oct 11 '12 at 01:59
  • I checked and in gcc 4.7, `char16_t` and `char32_t` are keywords. In this answer, it seems to imply that they are not *primitive types* but I believe that they are. They seem to be the first primitive types that use the xxx_t naming convention. Am I mistaken? If not primitive types, what would you call them and why are they not primitive types?? – Colin D Bennett Nov 07 '13 at 23:31
  • @KerrekSB: you say "The primitive integral types are {signed, unsigned} {char, short, int, long, long long}" and then "addition to those integral types, there are four distinct, unique, fundamental types, char, wchar_t, char16_t and char32_t" ... but my question is: isn't char32_t a *primitive type*? If it's not, then what is the difference between a *primitive type* and a *fundamental type*? – Colin D Bennett Nov 08 '13 at 19:08
  • 1
    @ColinDBennett: "primitive" is a colloquialism, not in the standard. The standard only defines "fundamental" types. – Kerrek SB Nov 08 '13 at 19:14
  • @KerrekSB: OK, thanks. That makes sense. So there are (1) fundamental types, (2) user-defined types, and (3) typedefs which are alternative names for the *same* type (typedefs do not create new types). – Colin D Bennett Nov 08 '13 at 19:29
  • 1
    @ColinDBennett: There's difference between *types* and type *names*. Object types are only either fundamental, compound or user-defined, but type *names* can be built-in, or the name of a user-defined type (something declared with a class key), or a declared type alias (something declared with `using` or `typedef`). To my knowledge there are no "name traits" which tell you if a name is the original name of something or an alias. – Kerrek SB Nov 08 '13 at 20:27
4

are these names fundamentally flawed?

(I think most of this question has been answered in the comments, but to make an answer) No, not at all. char16_t and char32_t were created for a specific purpose. To have data type support for all Unicode encoding formats (UTF-8 is covered by char) while keeping them as generic as possible to not limit them to only Unicode. Whether they are unsigned or have a fixed-width is not directly related to what they are: character data types. Types which hold and represent characters. Signedness is a property of data types that represent numbers not characters. The types are meant to store characters, either a 16 bit or 32 bit based character data, nothing more or less.

Jesse Good
  • 50,901
  • 14
  • 124
  • 166
  • And the only thing the width really affects, behavior-wise, is overflow modulo 2ᴺ. But addition and subtraction on character codeunits only have meaning within certain ranges of related consecutive characters (e.g. the Arabic numerals which are guaranteed to be consecutive). So the overflow behavior really is not important. – Ben Voigt Jan 25 '13 at 21:55
-3

They are not fundamentally flawed, by definition - they are part of the standard. If that offends your sensibilities then you must find a way to deal with it. The time to make this argument was before the latest standard was ratified, and that time has long passed.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 2
    *"I know not whether ISO/IEC be right, nor whether ISO/IEC be wrong. But all we know whose compiles fail is that the types are strong. (And that a char16_t is like a char32_t...and that the char32_t is unsigned long.)"* – HostileFork says dont trust SE Oct 09 '12 at 02:45
  • 13
    The first sentence of this answer implies that there is nothing flawed in the C++ language standard. That is a bold claim with which I am not sure many informed people would agree. – James McNellis Oct 09 '12 at 03:06
  • "The time to make this argument was before the latest standard was ratified, and that time has long passed." Not really - it's not like C++11 will be the last. Are you saying that all arguments regarding flaws in C++03 should have been silenced at the time? – Lightness Races in Orbit Oct 09 '12 at 09:16
  • @JamesMcNellis, the question made a bold claim which called for a bold response. And regardless of what you think of it, it *is* the standard and this part of it is not going to change. – Mark Ransom Oct 09 '12 at 12:00
  • @MarkRansom: It was not a claim; it was a question. Indeed things in the standard can be flawed in the scope of language design, even though they are "correct" by definition in the scope of the language itself. "Find a way to deal with it" is hardly a technical interpretation. – Lightness Races in Orbit Oct 09 '12 at 12:26
  • 1
    @LightnessRacesinOrbit, that was part of my point - a technical interpretation is hardly called for here because it's *just a naming convention*. The types would be what they are no matter what they're called. It wouldn't rise to the level of a "fundamental flaw" unless it produced a logical contradiction in the standard. – Mark Ransom Oct 09 '12 at 13:19