Why is char16_t defined to have the same size as uint_least16_t instead of uint16_t?

Question

Reading C++17 draft §6.9.1/5:

Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <cstdint>, called the underlying types.

Now referring to C11 draft §7.20.1.2/2, which is the reference for the C library inheritance:

The typedef name uint_leastN_t designates an unsigned integer type with a width of at least N , such that no unsigned integer type with lesser size has at least the specified width. Thus, uint_least16_t denotes an unsigned integer type with a width of at least 16 bits.

Note the "at least" part. This means that char16_t may actually have e.g. 32 bits, making an array of char16_t a bad representation of a UTF-16 raw data. In this case writing such an array to a binary file would result in valid code units alternating with U+0000 characters.

Is there a good reason for char16_t to be defined in terms of uint_least16_t instead of uint16_t? Or is it simply a defect in the standard?

@PanagiotisKanavos that has nothing to do with the size of `char16_t` — Caleth, Jun 21 '18 at 10:07
[Here's an example](https://stackoverflow.com/a/6972551/597607) where `uint_least16_t` would use 18 bits. — Bo Persson, Jun 21 '18 at 10:09
@Caleth it has to do with whether `char16_t` can fulfill its job of holding a single UTF16 character. — Panagiotis Kanavos, Jun 21 '18 at 10:09
Quality of implementation. Implementing all function for 16bit characters (and selecting a coding) is a huge work (if done correctly and fast). So C++ give the possibility of many compilers to implement only char32_t. Most of compiler will not care about 16 and 32 bit chars (on recent times), so the standard simplify the implementation of simpler compiler (so again, it is about "quality of implementation") — Giacomo Catenazzi, Jun 21 '18 at 10:11
@rubenvb: right, but such platforms (and platform which didn't have good support for C) are obsolete (and IIRC the 48bit was also not compatible with C, because of the standard character encoding [C requires a-z to be in one block). — Giacomo Catenazzi, Jun 21 '18 at 10:16
@GiacomoCatenazzi `C requires a-z to be in one block` are you sure? Can you tell me the section of that rule? I was under the impression that wasn't required. — eerorika, Jun 21 '18 at 10:29
@GiacomoCatenazzi see Bo Persson's comment above. And frankly, current obsolescence doesn't matter for the C or C++ standards. Note e.g. the trouble they go through to specify two's complement behaviour without mandating actual two's complement implementation. — rubenvb, Jun 21 '18 at 10:30
@PanagiotisKanavos `char16_t` doesn't have to hold a UTF-16 _character_. It only must hold a UTF-16 _code unit_, which is always exactly 16 bit. — Ruslan, Jun 21 '18 at 10:31
@user2079303: right. I was confusing things. POSIX has such requirement, and C only that 0 to 9 are consecutive. — Giacomo Catenazzi, Jun 21 '18 at 10:35
@rubenvb: are you sure? C is not created to be universal to all CPUs, it has done some strong design choices. Z80 cannot execute C in a optimal manner, so people states that it was one of the reason for its decline. C has the "as if" clause, so if can run on all platform (but so simulating 32bit on a 48bit machine). On ISO, the health of actual implementations and architectures is a requirement to be checked regularly (and reported). Without a working C compiler (implementing the latest standard, the support of 18 bit is just a convention, if they do not disturb the other part of standard). — Giacomo Catenazzi, Jun 21 '18 at 10:43
@GiacomoCatenazzi disagree with your "C is not created to be universal to all CPUs". If the opposite were true as you are implying, it would specify a whole lot more detail in much simpler wording than it does now. The fact that even the size of a byte/`char` isn't specified completely invalidates that statement. — rubenvb, Jun 21 '18 at 10:52
@PanagiotisKanavos there is no such thing as UTF16 character. There are UTF characters, and then there are UTF16 code units. — n. m. could be an AI, Jun 21 '18 at 10:54
What part of "no unsigned integer type with lesser size has at least the specified width" confuses you? If char16_t actually has 32 bits, well, you don't have anything better to play with. — n. m. could be an AI, Jun 21 '18 at 10:57
@rubenvb:You look too much on encoding (bit and bytes). In any case: when C was specified, there were still 6-bit and 7-bit machines. C requires stack, floating point support (32 and 64bit, in a well defined manner), pointers and pointer arithmetic and function pointer (also not always common, do it with a 8 bit CPU). Do not forget that stdlib was not implementable on many OS. C do not use (and make it difficult to you to use it): BCD, vector, record calculations,(common in past CPU), size of data which depends on some bit on the same data, etc.. C obsoleted a lot of architectures. — Giacomo Catenazzi, Jun 21 '18 at 11:53
@GiacomoCatenazzi C was designed to support a wide range of CPU types, that's why it still support 3 types of signed types and can run on any architectures with CHAR_BIT >= 8. The points you said are mostly nonsense. Hardware floating-point support wasn't available widely at the time. There's nothing prevents a 8-bit MCU from doing pointer arithmetic, and you can have data pointer and function pointers with different sizes which is commont in Harvard architecture — phuclv, Jun 21 '18 at 12:24
@phuclv "wide range" != all CPU. By design C was not made to support all CPU. As I say, the C standard as the "as if" clause, but do not mean that all CPU are supported at the same manner. Implement a C compiler for Z80 and you see what I mean. Implement the std library on many old OS, and you see. For the floating point, you do not get, that C support the **4** signed integer (signed, one complement, two complement, translation of zero), but it requires fix pattern of floating point numbers.Also a 4 bit machine (if Turing complete) can execute any C program, but C is not designed for them. — Giacomo Catenazzi, Jun 21 '18 at 13:28
@GiacomoCatenazzi [C++ doesn't require a fixed format for floating-point types](https://stackoverflow.com/q/2234468/995714), and [it doesn't require](https://stackoverflow.com/questions/79923/what-and-where-are-the-stack-and-heap#comment1197881_79936) [a stack either](https://stackoverflow.com/questions/48279268/why-did-c-never-implement-stack-extension#comment83542119_48279268) [Are there stackless or heapless implementation of C++?](https://stackoverflow.com/q/10900885/995714) — phuclv, Jun 24 '18 at 08:35
Besides, I don't know about Z80 but [it seems that the problem is due to bad compilers at the time](https://retrocomputing.stackexchange.com/q/6095/1981), and not specifically to C — phuclv, Jun 24 '18 at 08:57

score 5 · Accepted Answer · answered Jun 21 '18 at 12:12

First, as the name suggests, uint_least16_t is required to be the smallest size that can hold 16 bits. On a system with both 16- and 32-bit integers, it cannot be 32 bits.

Second, uint16_t is not required to exist. It's only there on systems that have a 16-bit integer type. Granted, those are quite common, but C and C++ are designed to impose minimal constraints on the hardware that they can target, and there are systems that don't have a 16-bit integer type.

On a system that does have a 16-bit integer type, uint16_t will be 16 bits wide (duh...), and uint_least16_t will also be 16 bits wide. On a system that doesn't have a 16-bit integer type, uint16_t won't exist, and uint_least16_t will. For code that needs to store values in a range that's representable in 16 bits, using uint_least16_t is portable. For code that needs to store exactly 16 bits (which is rare), uint16_t is the way to go.

eerorika · Answer 2 · 2018-06-21T10:17:39.093

4

This makes it possible to use char16_t on systems where byte size is not a factor of 16 (for example 32 bit byte, or 9 bit byte) . Such systems can have uint_least16_t but not uint16_t.

edited Jun 21 '18 at 10:17

answered Jun 21 '18 at 10:00

eerorika

232,697
12
197
326

3

@PanagiotisKanavos However, a UTF-16 code unit is always 16 bits, which is more relevant since a `char16_t` stores one code unit. It's not machine specific either. – eerorika Jun 21 '18 at 10:07
[No it's not](https://en.wikipedia.org/wiki/UTF-16). It can be 16 or 32 bits. The 16-bit only subset was UCS-2. Windows *used* to support UCS-2 only, but works with full UTF16 nowadays – Panagiotis Kanavos Jun 21 '18 at 10:08
5

@PanagiotisKanavos You seem to misunderstand the distinction between an encoding's "code unit" and unicode "code points". From the article you link (emphasis mine): "In UTF-16, **code points** greater or equal to 2^16 are encoded using two 16-bit **code units**" – rubenvb Jun 21 '18 at 10:12
@rubenvb we were talking about bits, not code points or code units. – Panagiotis Kanavos Jun 21 '18 at 10:13
4

@PanagiotisKanavos No, we're not (at least user2079303 and I are not). `char16_t` is meant to store **code units**, not *characters*. So the C++ standard specifies the loosest definition of it: is must be able to hold at least 16 bits. Which is sufficient. – rubenvb Jun 21 '18 at 10:15
@rubenvb thanks for reminding me how easy it is to use Unicode in other languages – Panagiotis Kanavos Jun 21 '18 at 10:16
1

`char16_t` has a fixed size, because it is **a C++ type**. `u""` is a `char16_t[3]` literal. u'' is *ill-formed* – Caleth Jun 21 '18 at 10:18
1

@PanagiotisKanavos If you mean other programming languages, the specification of `char16_t` has no influence on how easy it is to use unicode in other languages. Their standard libraries might provide more tools to handle it, but the programmer still needs to be aware of the differences between code points, characters, and graphemes. – rubenvb Jun 21 '18 at 10:19
moreover `char16_t` is defined as `uint_least16_t` which implies it can only store a UTF-16 code unit, otherwise you'll need `uint_least21_t` for a codepoint – phuclv Jun 21 '18 at 10:20

Why is char16_t defined to have the same size as uint_least16_t instead of uint16_t?

2 Answers2