What exactly are char16_t and char32_t, and where can I find them?

Question

I was looking for char16_t and char32_t, since I’m working with Unicode, and all I could find on the Web was they were inside uchar.h. I found said header inside the iOS SDK (not the macOS one, for some reason), but there were no such types in it. I saw them in a different header, though, but I could not find where they're defined. Also, the info on the internet is scarce at best, so I’m kinda lost here; but I did read wchar_t should not be used for Unicode, which is exactly what I’ve been doing so far, so please help:(

It's just `typedef`'s for integer types such as `unsigned short` or `unsigned int`. Nothing more to say. — DeiDei, Sep 09 '18 at 01:24
@DeiDei So, I don’t really need them? Interesting. I’d still like to know where they reside, though... — Rodrigo Pelissier, Sep 09 '18 at 01:26
They reside in `uchar.h`. That's what the Standard says. If you can't find them there on a certain implementation, it's purely a detail of that implementation. It may be included somewhere deeper in the file. — DeiDei, Sep 09 '18 at 01:29
@DeiDei - that's not strictly correct. An `unsigned short` is at least 16 bits (and on any modern sane platform this is true); but `unsigned int` is *not* required to be 32 bits. `unsigned long`, on the other hand, is required to be at least 32 bits wide. All that said, a platform that can't provide an exact 8, 16, 32, 64 bit unsigned / signed type (either by platform or compiler) should be considered a failed ISA / ABI. — Brett Hale, Sep 10 '18 at 07:46
char16_t is typedef'd as `uint_least16_t` and char32_t is typedef'd as `uint_least32_t` according to the standard. — MarcusJ, Jun 01 '19 at 01:25

Eric Postpischil · Answer 1 · 2021-02-18T13:45:54.943

6

char16_t and char32_t are specified in the C standard. (Citations below are from the 2018 standard.)

Per clause 7.28, the header <uchar.h> declares them as unsigned integer types to be used for 16-bit and 32-bit characters, respectively. You should not have to hunt for them in any other header; #include <uchar.h> should suffice.

Also per clause 7.28, each of these types is a narrowest unsigned integer type with required number of bits. (For example, on an implementation that supported only unsigned integers of 8, 18, 24, and 36, and 50 bits, char16_t would have to be the 18-bit size; it could not be 24, and char32_t would have to be 36.)

Per clause 6.4.5, when a string literal is prefixed by u or U, as in u"abc" or U"abc", it is a wide string literal in which the elements have type char16_t or char32_t, respectively.

Per clause 6.10.8.2, if the C implementation defines the preprocessor macro __STDC_UTF_16__ to be 1, it indicates that char16_t values are UTF-16 encoded. Similarly, __STDC_UTF_32__ indicates char32_t values are UTF-32 encoded. In the absence of these macros, no assertion is made about the encodings.

edited Feb 18 '21 at 13:45

answered Sep 09 '18 at 02:17

Eric Postpischil

195,579
13
168
312

I can definitely not include such header on macOS. – Rodrigo Pelissier Sep 09 '18 at 02:23
2

@RodrigoPelissier: Indeed it appears not to be present for macOS, although it is for iOS. I suggest filing a bug report with Apple about that. I expect using `typedef` to define the types as `uint_least16_t` and `uint_least32_t` could be a workaround. The C standard requires them to be the same as those types. – Eric Postpischil Sep 09 '18 at 02:28
How could I handle EOF with those? – Rodrigo Pelissier Sep 09 '18 at 02:39
@RodrigoPelissier "How could I handle EOF with those" sounds like another question. Perhaps accept one answer here and post a new query. – chux - Reinstate Monica Dec 15 '18 at 07:05
The problem is that Apple defines `__CHAR16_TYPE__` but does not include `uchar.h` in the MacOS SDK, so not even the preprocessor can save you without disabling it for all Apple platforms anyway. – MarcusJ Jun 01 '19 at 01:27

score 1 · Answer 2 · answered Sep 09 '18 at 01:58

1

Microsoft has a fair description: https://learn.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2017

char is the original, typically 8-bit, character representation.
wchar is a "wide char", 16-bits, used by Windows. Microsoft was an early adopter of Unicode, unfortunately this stuck them with this only-used-on-Windows encoding.
char16 and char32, used for UTF-16 and -32

Most non-Windows systems use UTF-8 for encoding (and even Windows 10 is adopting this, https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8). UTF-8 is by far the most common encoding used today on the web. (ref: https://en.wikipedia.org/wiki/UTF-8)

UTF-8 is stored in a series of chars. UTF-8 is likely the encoding you will find simplest to adopt, depending on your OS.

answered Sep 09 '18 at 01:58

Graham Perks

23,007
8
61
83

I’d like it to work on as many platforms as possible, but I’m currently focusing on macOS. I set the locale to UTF-8, but I was still having trouble when using just `char` (I couldn’t iterate over them, as a single one contained multiple characters). So I was using `wchar_t`, which, at least on my Mac, works perfectly; but I got a little concerned when I read earlier `wchar_t` is not supposed to be used for such purposes. – Rodrigo Pelissier Sep 09 '18 at 02:02
2

You are right in that you can no longer iterate a char array as you can with ASCII. A character can be 1, 2, 3 or more bytes. While UTF8 is backwards compatible with e.g. strcpy, that's not so for iterating or strlen - those operate on bytes not characters. An interesting read: https://utf8everywhere.org. This question has answers offering cross-platform solutions: https://stackoverflow.com/questions/4579215/cross-platform-iteration-of-unicode-string-counting-graphemes-using-icu – Graham Perks Sep 09 '18 at 02:13

What exactly are char16_t and char32_t, and where can I find them?

2 Answers2