71

In the CPP reference documentation,

I noticed for char

The character types are large enough to represent any UTF-8 eight-bit code unit (since C++14)

and for char8_t

type for UTF-8 character representation, required to be large enough to represent any UTF-8 code unit (8 bits)

Does that mean both are the same type? Or does char8_t have some other feature?

ruohola
  • 21,987
  • 6
  • 62
  • 97
Pavan Chandaka
  • 11,671
  • 5
  • 26
  • 34
  • 11
    Well, it's clear from looking that `char8_t` is an 8 bit type. Also, *The signedness of char depends on the compiler and the target platform: the defaults for ARM and PowerPC are typically unsigned, the defaults for x86 and x64 are typically signed.* while `char8_t` is **always** unsigned. – Elliott Frisch Aug 07 '19 at 21:28
  • " or does char8_t has an extra edge?" - what do you mean by that? –  Aug 07 '19 at 21:29
  • I mean any other benefits – Pavan Chandaka Aug 07 '19 at 21:44
  • Rats. I was hoping you meant like the magic sword from The Sword and the Sorcerer. – user4581301 Aug 07 '19 at 21:47
  • 1
    Logically, code can assume that a string of `char8_t` always contains UTF-8 text (barring bugs), whereas it is less safe to assume any particular encoding of a `char` string without additional knowledge of the environment. – Miral Aug 08 '19 at 07:31
  • 8
    Well, there _are_ benefits. The `char` type, like much of C++'s C heritage is, and has always been annoyingly broken. You do not know whether it's signed or not, and very strictly you do not even know how many bits it has (though 8 is a rather safe bet, there's no guarantee whatsoever). The `char8_t` type gives both guarantees. Unluckily, nobody was bold enough to simply "fix" the broken original type (which could admittedly break existing code, but so what... modern C++ is incompatible with legacy C++ anyway). Much like nobody could be bothered to make `size_t` or `ptrdiff_t` a _proper_ type. – Damon Aug 08 '19 at 08:48
  • 5
    @Damon according to [this comment](https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char/57402487?noredirect=1#comment101288657_57402487), there is no requirement that `char8_t` is exactly eight bits, so nothing changed in that regard… – Holger Aug 08 '19 at 09:04
  • @Holger: Funnily, the C++ standard **indeed** doesn't require that there be exactly 8 bits. Nor does it require that for any of the stuff, it merely says "Yeah, blah blah, same as in C". Now, C doesn't say either... it says "Yeah, blah blah POSIX". Luckily, POSIX _does_ say :-) This is a _"exact width type"_ in POSIX talk (as opposed to the `_least` or `_fast` types, which are, at least as large, and could be, well, basically anything). – Damon Aug 08 '19 at 17:43
  • 5
    @Damon C has always guaranteed that `char` has *at least* 8 bits. POSIX and most other systems like Windows guarantee that `char` is exactly 8 bits. But C does **not** say "Yeah, blah blah POSIX". POSIX incorporates the C standard, not the other way around. And unless C suddenly decides to alienate a huge part of its niche, they're not going to make an exactly eight bit type mandatory, because C is the primary language used to program all the modern embedded/niche hardware which has bytes bigger than eight bits. – mtraceur Apr 23 '20 at 06:34

2 Answers2

100

Disclaimer: I'm the author of the char8_t P0482 and P1423 proposals.

In C++20, char8_t is a distinct type from all other types. In the related proposal for C, N2653, char8_t is a typedef of unsigned char similar to the existing typedefs for char16_t and char32_t.

In C++20, char8_t has an underlying representation that matches unsigned char. It therefore has the same size (at least 8-bit, but may be larger), alignment, and integer conversion rank as unsigned char, but has different aliasing rules.

In particular, char8_t was not added to the list of types at [basic.lval]p11. [basic.life]p6.4, [basic.types]p2, or [basic.types]p4. This means that, unlike unsigned char, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other types. A consequence of this is that objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:

reinterpret_cast<const char   *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text");   // Undefined behavior.

The motivation for a distinct type with these properties is:

  1. To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.

  2. To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).

  3. To ensure an unsigned type for UTF-8 data (whether char is signed or unsigned is implementation defined).

  4. To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10
  • 14
    Why is it char8_t not uchar8_t? – Mala Aug 13 '19 at 20:46
  • 29
    Because `char8_t` is consistent with `char16_t` and `char32_t` (also unsigned types). – Tom Honermann Aug 14 '19 at 01:26
  • 1
    @TomHonermann of the reasons listed there, IMHO only 2 and perhaps 4 make sense. 1 can be achieved with the regular char type, wheras 3 is irrelevant unless you're doing arithmetic operations on your chars. – Martin Oct 31 '22 at 12:35
  • 1
    @Martin, Yes, plain `char` can certainly be used to manipulate UTF-8 data, but there is ample evidence that programmers struggle with maintaining a correct association of a character encoding with data stored in `char`; mojibake is all around us still today. `char8_t` certainly doesn't solve all such problems, but it does provide some guard rails. As for 3, the number of a bits in `char` as well as whether it is signed or unsigned impacts checking for leading vs trailing code units. `c >= 0x80` is not a portable way to check for a trailing code unit value. – Tom Honermann Nov 02 '22 at 17:33
  • @TomHonermann thanks for the reply. I hadn't thought of that kind of check; from that perspective what you say in 3 makes sense. I guess the idea is to support implementations where CHAR_BIT != 8? – Martin Nov 03 '22 at 11:59
  • 1
    @Martin, though implementations with `CHAR_BIT` other than 8 do exist and the standard must remain internally consistent for all theoretical implementations, such implementations were not a motivation for `char8_t`. For the trailing code unit check example, note that `c >= 0x80` (where `c` has type `char`) is always false for an implementation with an 8-bit signed `char` type. `c < 0` could be used instead for such implementations, but that is always false for implementations with an unsigned `char` type. – Tom Honermann Nov 03 '22 at 17:11
  • @Tom Honermann Sorry for being late: What is the recommended way to handle the char8_t and char conversion right now? boost::nowide doesnt not yet understand char8_t nor (AFAIK) boost::locale. – schorsch_76 Nov 07 '22 at 08:02
  • @schorsch_76, the C and C++ standards currently lack interfaces for converting between the `char` (locale-based) encoding and UTF-8. Work on such interfaces is underway. Low level conversion interfaces for C are being pursued by WG14 via [N3031](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3031.htm), Once available, higher level interfaces like those specified in [P1629](https://wg21.link/p1629) will hopefully be adopted by WG21 for C++. In the meantime, utilities like `iconv` or other conversion libraries are required. – Tom Honermann Nov 08 '22 at 05:00
  • @TomHonermann I think with that proposal, you made the language even less usable, for negligible theoretical performance gain. I have just started using C++/20, and to avoid compiler errors I have to cast these pointers between `char8_t` and `char` all the time. A trivial example — to upcast UTF8 to UTF16, I call `MultiByteToWideChar` function. That API is almost 30 years old by now (available since WinNT 3.5), and it will never change because doing so gonna break builds of pretty much every Windows software in the world. – Soonts Jan 05 '23 at 11:27
  • @Soonts, performance improvements were not a primary motivation for `char8_t`. It sounds like you are passing UTF-8 string literals to `MultiByteToWideChar()`. Why not call a function that converts specifically from `char8_t` to `wchar_t`? Doing so will give you type and encoding safety without the need for casts at the call sites (if the function is implemented with `MultiByteToWideChar()`, a cast would be needed internally). An encoding specific function might perform better as well. – Tom Honermann Jan 05 '23 at 20:36
  • @TomHonermann Here’s an example for Linux: https://www.man7.org/linux/man-pages/man3/printf.3.html Similar to `MultiByteToWideChar` on Windows, that API is decades old, and will never change for backward compatibility reasons. Even if you miraculously convince Linux developers to expose another version for these API for new UTF8 strings, that alone won’t be enough because there’re quite a few similar APIs maintained by other teams, like libsystemd https://www.man7.org/linux/man-pages/man3/sd_notify.3.html and many other OS components. – Soonts Jan 07 '23 at 15:22
  • @TomHonermann I don’t expect that even 20 years in the future people will update all these legacy APIs to support that new UTF8 string type. Instead, people will cast at call sites, become frustrated with the usability of the C++ language, and whenever they can pick other languages instead. Easy interop with C is one of the two reasons why I still regularly pick C++ for new projects (another one is SIMD intrinsics). And in C++/20 you made it less usable for no good reason. – Soonts Jan 07 '23 at 15:24
  • @scoots, even before the introduction of `char8_t`, passing a UTF-8 string literal to one of the `printf()` family of functions, as either the format string or as an argument, results in undefined-behavior if the program is run in a non-UTF-8 locale (as is the typical case on Windows). See my answer to another question at https://stackoverflow.com/a/58895428/11634221. – Tom Honermann Jan 08 '23 at 04:54
  • @scoots, I don't expect to see new `char8_t`-based interfaces widely deployed. I also don't expect to see the entire C++ ecosystem switch to UTF-8 for all locales in the near future; not for Windows and not for the EBCDIC-based systems that will continue to be used for a long time to come. `char8_t` is most useful as the internal encoding of a program; it can help to prevent inadvertent mojibake when producing/consuming text from outside the program boundaries. I agree some people will just add casts, but that is not what I would advise. – Tom Honermann Jan 08 '23 at 05:07
58

char8_t is not the same as char. It behaves exactly the same as unsigned char though per [basic.fundamental]/9

Type char8_­t denotes a distinct type whose underlying type is unsigned char. Types char16_­t and char32_­t denote distinct types whose underlying types are uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>.

emphasis mine


Do note that since the standard calls it a distinct type, code like

std::cout << std::is_same_v<unsigned char, char8_t>;

will print 0(false), even though char8_t is implemented as a unsigned char. This is because it is not an alias, but a distinct type.


Another thing to note is that char can either be implemented as a signed char or unsigned char. That means it is possible for char to have the same range and representation as char8_t, but they are still separate types. char, signed char, unsigned char, and char8_t are the same size, but they are all distinct types.

NathanOliver
  • 171,901
  • 28
  • 288
  • 402
  • 10
    @MichaelDorgan But 98 is bigger than 17 and 98 was... not so fun to work with ;) – NathanOliver Aug 07 '19 at 21:46
  • 1
    You might want to mention that `char`, unlike `char8_t`, may be either signed or unsigned. It's possible for `char` and `char8_t` to have the same range and representation (with both having the same *underlying type*, `unsigned char`), but they're still distinct types. – Keith Thompson Aug 07 '19 at 21:51
  • @NathanOliver Better. – Barry Aug 07 '19 at 21:51
  • 3
    @MichaelDorgan: Isn't the "compatibility" with C which increases complexity? as sign issue of `char`. – Jarod42 Aug 07 '19 at 21:53
  • @KeithThompson I've added a paragraph about that. – NathanOliver Aug 07 '19 at 21:54
  • 11
    @MichaelDorgan in case you are unaware, C also has `char16_t`, `char32_t` and associated char/string literals and manipulation functions. (As well as `char`, `unsigned char`, `signed char`, `int8_t` and `uint8_t` of course) – M.M Aug 07 '19 at 22:02
  • 4
    For some definition of "exactly the same". A key feature of `char8_t` is that it doesn't alias everything under the sun. – T.C. Aug 07 '19 at 22:16
  • 7
    So, did we actually need another name from something that already exists? – Michael Chourdakis Aug 07 '19 at 22:18
  • What happens when we exceed `size_t` number of integer types? – Paul Sanders Aug 07 '19 at 22:23
  • @MichaelChourdakis I believe it was done for consistences sake. It makes it a lot easier to write a macro to get `charN_t` when `8` is a valid `N` – NathanOliver Aug 07 '19 at 22:31
  • `using utf8 = char8_t;` Naming aside, a char8_t isn't a character, it is a UTF-8 encoding unit. Probably a quibble. I'm just happy seeing C++ become more, and more Unicode savvy without resorting to third party utilities (even excellent ones like ICU). – Eljay Aug 07 '19 at 22:37
  • @M.M - Yup aware. I kinda have to be as its my job to make sure all those things continue to work. But when I'm in system/OS land and having to deal with C++ edge cases again and again... Anyway, so yeah and get off my lawn :) – Michael Dorgan Aug 07 '19 at 22:41
  • 22
    @MichaelChourdakis: "*So, did we actually need another name from something that already exists?*" Yes. If I give you a `const char*`, is it UTF-8 encoded? You don't know. If I instead give you a `const char8_t *`, then if it *isn't* UTF-8 encoded, *I am a liar*. Types matter, and if C++ is going to get decent Unicode support, we must have types that represent strings encoded in a Unicode encoding, not merely whatever the compiler felt like. The only real problem with `char8_t` is that few existing APIs that *could* take them do so. And that's a problem that will be solved as Unicode gets done. – Nicol Bolas Aug 07 '19 at 22:57
  • 13
    Interestingly, there's no requirement that `char8_t` is exactly 8 bits. Since it has the same representation as `unsigned char`, it's `CHAR_BIT` bits. Unlike `uint8_t`, which isn't defined if there's no 8-bit integral type, `char8_t` is always defined. (There are probably no hosted implementations with `CHAR_BIT != 8`.) – Keith Thompson Aug 08 '19 at 00:01
  • 1
    @KeithThompson that makes me wonder how an application will process real life UTF-8 encoded text, which is definitely a sequence of 8Bit-Bytes, when there is no no 8-bit integral type. There is no need to process a single UTF-8 unit, except when you are implementing the very code which will assemble them into a Unicode codepoint, which should be implemented only once (preferable in a standard library). That method needs to be able to define the input as array of 8 bit units, which UTF-8 encoded text always is. When reading the element, any integral type of at least 8 bit suffices. – Holger Aug 08 '19 at 08:01
  • 3
    @Holger: `CHAR_BITS` is *at least* 8. Assume we are dealing with an implementation where `CHAR_BITS` is (eg) 9, and our UTF-8 encoded text is arriving over the network. The call to `read` (or whatever the networking primitive is called) will receive octets from the network, and store them in 9-bit bytes (using the word in its C++ standard meaning) in memory. Similarly a file containing UTF-8 will store each UTF-8 sub-unit in a 9-bit byte (with a leading zero bit). The file will not pack nine UTF-8 units into eight 9-bit bytes. (Or at least, it won't unless somebody is being silly). – Martin Bonner supports Monica Aug 08 '19 at 08:34
  • 1
    @MartinBonner and that automatism also works in the other direction, i.e. the ninth bit is always discarded when writing a sequence of these 9 bit bytes to a file or sending data over the network? Well *that’s* what I’d call silly. But let’s not judge that hypothetical architecture, let’s talk about the C++ standard that takes so much care to support such hypothetical systems transparently. What do you think, how many real life applications handle these aspects of the standard correctly, in that these applications will work smoothly on such systems? – Holger Aug 08 '19 at 08:54
  • 2
    @Holger As Martin says, incoming UTF-8 data would probably have to be stored in bytes rather than in octets. As for writing output, my guess is that writing data to a text stream would strip it to 8 bits, but writing to a binary stream would retain all `CHAR_BIT` bits (because you have to be able to read back the same binary data you wrote). But it's unlikely to matter, because as far as I know all hosted implementations have `CHAR_BIT==8`. (Some DSPs set `CHAR_BIT` to 16 or 32, but they're not hosted so they don't have to support standard I/O.) – Keith Thompson Aug 08 '19 at 09:19
  • 2
    @KeithThompson but would that imply that applications have to read UTF-8 input with special functions or are unsigned bytes and “UTF-8 units”, i.e. `char8_t`, just interchangeable, even on those exotic systems? I also have the feeling that it hardly ever matters, however, there must be a reason why the C++ standard committee puts such a burden on the programmer… – Holger Aug 08 '19 at 09:27
  • @Holger: `char8_t` by definition has the same size, range, and representation as `unsigned char`, whatever size that is. Probably there would be some way to translate 8-bit UTF-8 text to a form that could be stored on such a system. It's unlikely to come up in practice. I'm not sure what burden you're referring to. – Keith Thompson Aug 08 '19 at 23:41
  • 1
    @Holger That's not a hypothetical architecture - 9 bit bytes have existed in the past in real hardware, and C ran on it, and it did exactly that - the top bit is simply ignored when reading or writing octet-based network or storage data. *To this day* systems like that exist, although the new ones are merely emulating 9 bit byte hardware on top of 8 bit byte hardware. (I call the 9 bit byte the "banker's byte", because of course the only people willing to keep the 9 bit byte going are the financial industry, where they'll do anything to not have to rewrite software.) – mtraceur Apr 23 '20 at 06:44
  • 1
    @mtraceur that’s contradicting. When the C implementation simply ignores the 9th bit, the behavior is just as if the standard said “a byte has 8 bits”, but the standard is not saying it. It allows C implementations which do *not* ignore the bit and require the application programmer to deal with it. – Holger Apr 23 '20 at 06:54
  • @Holger I said it is ignored specifically when doing I/O over octet byte mediums, with "ignored" obviously meaning set to zero on read, unused on write, as you were already discussing in prior comments. The bit is fully accessible and usable the rest of the time. The whole C implementation does not ignore it, just some I/O routines in the provided libraries do. – mtraceur Apr 23 '20 at 09:18
  • Being a distinct type also affects overload resolution. Previously `template void Foo(T&& x, const char* name)` would be selected for a call like `Foo(5, u8"FOO")`. But when C++20 is enabled that function template will be ignored. So, it is a breaking change in the standard. https://github.com/tahonermann/char8_t-remediation – zahir Apr 08 '22 at 14:58