10

Update 2022 Jul 28

P2513R4, char8_t Compatibility and Portability Fix, Draft Proposal, 2022-06-17

Two years after and char8_t definition (or lack of) is now called a "C++20 defect" and there is a rush to fix it. Finally.

Update 2020 Aug 25

The question seems somewhat irrelevant in the light of this:

// GCC 10.2, clang 10.0.1  -std=c++20

int main(int argc, char ** argv) 
{
    char32_t single_glyph_32 = U'ア' ;
    char16_t single_glyph_16 = u'ア' ;
    // gcc:   error: character constant too long for its type
    // clang: error: character too large for enclosing character literal type
    char8_t single_glyph_8 = u8'ア' ;

    return 42;
}

char8_t seems capable of handling just a tiny portion of UTF-8 glyphs. Thus there is no much point in using it or trying to printf it.

Asked Nov 15 '19 at 14:04

And also for char8_t?

I assume there is some C++20 decision, somewhere, but I could not find it. There is also P1428, but that doc is not mentioning anything about printf() family v.s. char8_t * or char8_t.

Use std::cout advice might be an answer. Unfortunately, that does not compile anymore.

// does not compile under C++20
// error : overload resolution selected deleted operator '<<'
// see P1423, proposal 7
std::cout <<  u8"A2";
std::cout <<  char8_t ('A');

For C 2.x and char8_t

Please start from here.

Update

I have done some more tests with a single element from a u8 sequence. And that indeed does not work. char8_t * to printf("%s") does work, but char8_t to printf("%c") is an accident waiting to happen.

Please see -- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd -- Problem is, as per the current status quo, char8_t is not implemented, char8_t * is. -- let me repeat: there is no implemented type to hold a single element from a char8_t * sequence.

If you want a single u8 glyph you need to code it as an u8 string

char8_t const * single_glyph = u8"ア";

And it seems at present, to print the above the sort of a sure way is

// works with warnings
std::printf("%s", single_glyph ) ;

To start reading on this subject, probably these two papers are required

  1. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm
  2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html

In that order.


My primary DEVENV is VisualStudio 2019, with both MSVC and CLANG 8.0.1, as delivered with VS. With std:c++latest. Dev machine is WIN10 [Version 10.0.18362.476]

Chef Gladiator
  • 902
  • 11
  • 23

4 Answers4

9

I'm the author of the char8_t P0482 and P1423 proposals for C++ (accepted for C++20) and the N2653 proposal for C (accepted for C23).

Let's think about what the following should do:

printf("Hello %s\n", u8"Jöel");
std::cout << "Hello " << u8"Jöel" << "\n";

Actually, let's take a further step back. What encoding is expected on the receiver side of standard output? There are a few possibilities. If standard out is connected to a console/terminal, then the expected encoding is the one that the console/terminal is configured for. On a Windows system in the United States, this is likely to be CP437. On a UNIX/Linux system, this is likely UTF-8. On a z/OS system in the United States, this is likely EBCDIC code page 037. If standard out has been redirected, then the expected encoding is likely locale dependent. On a Windows system in the United States, that would mean the Active Code Page (ACP), likely Windows 1252. On UNIX/Linux and z/OS, it would likely be the same as the console/terminal (Windows is the odd system here that has different defaults for console encoding vs locale encoding).

Back to that example code. What is the expected or desired behavior for that UTF-8 encoded ö character (U+00F6, {LATIN SMALL LETTER O WITH DIAERESIS}, encoded as 0xC3 0xB6)? For Windows writing to the console, for the character to display properly, the encoded sequence would need to be transcoded to 0x94 while for Windows where locale dependent output is expected, it would need to be transcoded to 0xF6. For UNIX/Linux, the sequence should probably be passed through. For z/OS, it may need to be transcoded to 0xCC. But on all of these systems, these defaults are configurable (e.g., via the LANG environment variable).

Assuming that transcoding to a run-time determined encoding is the desired behavior, how should transcoding errors be handled? For example, what should happen if the target encoding lacks representation for ö? What if an ill-formed UTF-8 sequence is present? Should printf stop and report an error? Should std::cout throw an exception? Or should an implementation defined character such as U+FFFD {REPLACEMENT CHARACTER} or ? be substituted?

What should happen if std::cout is imbued with a std::codecvt facet? Presumably that facet will expect incoming text to be in a particular encoding. Should UTF-8 text be transcoded to one of the execution character set, the locale dependent encoding, or the console/terminal encoding before being presented to the facet? If so, which one? Should the implementation have to be aware of whether the stream is connected to a console/terminal? What if the programmer wants to override the default and, for example, always write UTF-8?

These are rather difficult questions that we don't have good answers for. std::u8out has been suggested, as a way to explicitly opt-in to UTF-8, but doesn't solve the problems of expected standard output encoding, issues with codecvt facets, and other iostreams problems like implicit locale dependent formatting.

Personally, in order to provide good Unicode support going forward, I think we're going to have to invest in a replacement for iostreams that 1) provides byte output with text support layered on top, 2) is encoding aware (in the text layer), 3) is locale independent (but with explicit opt-in support for locale dependent formatting like that provided by std::format), 4) is more performant than iostreams.

SG16 would like to hear your thoughts and suggestions. See https://github.com/sg16-unicode/sg16 for contact information.

EDIT: As of 2022-05-22, there is a paper, N2983, making its way through WG14 that seeks to add length modifiers to the formatted I/O functions for char8_t, char16_t, and char32_t characters and strings.

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10
  • 3
    I forgot to mention. Our short term plan (C++23) for working around the noted limitations is to provide explicit encode, decode, and transcode interfaces as described in [P1629](https://wg21.link/p1629). This will allow programmers to manually transcode as necessary between the various execution and UTF encodings. – Tom Honermann Nov 17 '19 at 15:16
  • 3
    Dear Tom, I know about P1629. It is good and logical. But. The "only thing", I need is to have `printf()`, fully implemented and capable to output u8 sequences and single elements. That is `char8_t *` and `char8_t`.-- `u8` is "in" since 2011. And `char8_t` is C++20 keyword. Still, it seems there is no required decision and implementations. I might think utf-8 is now, rather mission critical. I assume C++ community at large can not wait till 2023 to have utf-8 fully decided and implemented in standard C++. – Chef Gladiator Nov 17 '19 at 16:07
  • What confuses me is essentially "only" two issues -- 1 --`` aka `` .. is that not "it" with `printf()` decisions and implementations added ? -- 2 -- How are projects not using std lib served by the snazzy and complex C++ summarized in P1629 and other papers in the SG16 domain? – Chef Gladiator Nov 17 '19 at 16:25
  • 4
    I agree that UTF-8 is rather mission critical these days and, as much as I would have liked to provide more support in C++20, that is no longer an option. I first presented `char8_t` to the committee in November 2016 and it wasn't until November 2018 that it was accepted. It then took several more meetings this year to get P1423 through the committee. Change doesn't always happen as quickly as we would like. – Tom Honermann Nov 18 '19 at 16:59
  • 3
    In the answer I provided, I asked what the behavior of `printf("%", u8"text")` should be. It isn't clear to me. I suspect you may have an opinion on what should happen and I further suspect there are design decisions we could make that you would find objectionable or questionable. What behavior would you prefer and why? – Tom Honermann Nov 18 '19 at 17:04
  • 2
    C++ defers to C for the specification of `printf` and friends. The C++ standard *could* place additional requirements on these functions, but most C++ implementations defer to an implementation of the C standard library that they do not control. Making change to `printf` will effectively require us to go through WG14. And WG14 won't have a new standard for at least three years. So, it will be some time before we see changes to `printf`; assuming we can agree on what those changes should be. – Tom Honermann Nov 18 '19 at 17:08
  • 1
    Thank's Tom, `::printf( "%@", u8"ひらがな" ) ;` should exactly print: `ひらがな` .. of course using correct font. -- replace '@' in the future -- I might think nobody would object to that? – Chef Gladiator Nov 19 '19 at 10:12
  • 1
    Stating that the UTF-8 provided characters should be printed isn't sufficient to describe the behavior. Consider the example I provided again: `printf("Hello %@\n", u8"Jöel")` when run on Windows. The `"Hello"` portion is encoded according to the execution character set (perhaps Windows-1252) and the output is going to either the console (perhaps CP437) or may be redirected to another process (perhaps expecting ACP encoding, e.g., Windows-1252). Should the UTF-8 content be transcoded? Should `printf` have to be aware of the redirection? How should transcoding errors be handled? – Tom Honermann Nov 19 '19 at 16:01
  • I might be so bold to think these are issues, outside of language and compilers jurisdiction. And especially not a job of a standard to solve these. Consider "%s" .. ISO C99 is defining it as to be used to output zero-terminated ASCI strings. I have never heard of programing language standards describing how is implementation going to implemented utf8 output. If I understand you right? – Chef Gladiator Nov 19 '19 at 16:16
  • 2
    The environment that a program runs in is outside the control of the standard, but how the program interacts with the environment is within its control. ISO C99 does not define the `%s` format specifier as being used to output *ASCII* strings, but rather to output "characters" (which may not be ASCII) up to but not including a `NUL` character. Java's `OutputStreamWriter` allows specifying an encoding (and defaults to locale settings). Python 3 changed its interaction with [PEP 538](https://www.python.org/dev/peps/pep-0538) and [PEP 540](https://www.python.org/dev/peps/pep-0540). – Tom Honermann Nov 19 '19 at 19:24
  • 1
    This is insane. What is the point of introducing a new type char8_t, when it is problematic to apply "std::cout" to a u8 string. – John Z. Li Dec 30 '20 at 02:23
  • 1
    @JohnZ.Li, Please see [P0482](https://wg21.link/p0482), [P1423](https://wg21.link/p1423), and [this answer](https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char/57453713#57453713) for motivation. We're well aware that support for `char8_t` is woefully incomplete and we are working to improve the situation. One important paper currently being discussed is [P2093](https://wg21.link/p2093r2). See reviews [here](https://github.com/sg16-unicode/sg16-meetings#november-11th-2020) and [here](https://github.com/sg16-unicode/sg16-meetings#december-9th-2020). – Tom Honermann Dec 31 '20 at 01:46
7

What is the printf() formatting character for char8_t *?

There is no format specifier that will print char8_t* as a string. Using %s is technically an undefined behavior because of a type mismatch and clang will warn you about it (https://godbolt.org/z/xcs9Wj):

printf("%s", u8"Привет, мир!");
...: warning: format specifies type 'char *' but the argument has type 'const char8_t *' [-Wformat]
  printf("%s", u8"Привет, мир!");
          ~~   ^~~~~~~~~~~~~~~~
          %s

So the only thing you can do is to print such string as a pointer with %p which is not very useful.

iostreams don't work with char8_t strings either. For example this doesn't compile in C++20:

std::cout << u8"Привет, мир!";

On most platforms normal char strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8 which will give you Unicode support on major operating systems.

For portable Unicode output you can use the {fmt} library, for example (https://godbolt.org/z/3ejsaG):

#include <fmt/core.h>

int main() {
  fmt::print("Привет, мир!");
}

prints:

Привет, мир!

Disclaimer: I'm the author of {fmt}.

vitaut
  • 49,672
  • 25
  • 199
  • 336
  • true indeed. Also (I assume the same as you) I am watching the addition of [``](https://en.cppreference.com/w/cpp/header/cuchar) to the C++20 conformant compilers. AFAIK that would be the only standard way to transform to/from `char8_t *`. – Chef Gladiator Dec 30 '20 at 21:30
  • Unfortunately mbrtoc8 and c8rtomb are of limited use because they rely on the global locale encoding. – vitaut Dec 30 '20 at 22:14
  • 1
    @ChefGladiator, you may be interested in following the progress of [WG14 N2620](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2620.htm). This paper proposes new interfaces for C to support conversions between all of the narrow, wide, and UTF encodings. – Tom Honermann Dec 31 '20 at 01:53
  • 2
    Many thanks Tom. Although we need solution yesterday not in 23. – Chef Gladiator Jan 01 '21 at 13:51
  • @ChefGladiator, it took me far longer than it should have, but I have finally submitted [N2653 (char8_t: A type for UTF-8 characters and strings (Revision 1))](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm) to WG14 and submitted an implementation to gcc, libstdc++, and glibc. Patches submitted to gcc [here](https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572022.html), libstdc++ [here](https://gcc.gnu.org/pipermail/libstdc++/2021-June/052685.html), and to glibc [here](https://sourceware.org/pipermail/libc-alpha/2021-June/127230.html). – Tom Honermann Jun 07 '21 at 03:12
  • @TomHonermann very good. I Will check at my earliest convenience. – Chef Gladiator Jun 09 '21 at 00:57
  • Implementations of the library portions of [N2653 (char8_t: A type for UTF-8 characters and strings (Revision 1))](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm) (the `char8_t` typedef and the `mbrtoc8()` and `c8rtomb()` functions have finally landed in glibc for the 2.36 release ([commit](https://sourceware.org/git/?p=glibc.git;a=commit;h=8bcca1db3d7c0dc900a4cad4054c1439baf73684)) that is expected to be released in August. – Tom Honermann Jul 06 '22 at 18:12
2

printf is not defined by C++20 itself; C++20 includes the C standard library by reference. It will likely reference C18, but that's substantially equal to C11 (no new features; just fixes defect reports).

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 1
    `cout << u8"A2";` `cout << char8_t ('A');` does not compile in C++20, as per P1428, proposal 7. – Chef Gladiator Nov 15 '19 at 14:28
  • 1
    @ChefGladiator I hope you didn't downvote. `printf` was **always** a C function. In C++ one should use streams like cout, wcout etc. `Effective C++` included that advice almost 20 years ago. If `std::cout` doesn't work, you'll have to find the correct stream. Using `printf` remains a C function – Panagiotis Kanavos Nov 15 '19 at 14:31
  • @ChefGladiator: Fair point, and it wasn't even part of the question so I removed it. – MSalters Nov 15 '19 at 14:35
  • Sorry MSalters :) @PanagiotisKanavos please do some more research. Or perhaps not. It can make you seriously unhappy. – Chef Gladiator Nov 15 '19 at 14:39
  • 2
    @PanagiotisKanavos "In C++ one should use streams like cout, wcout". Until they are deprecated, which will probably happen before long. They are essentially broken. What seemed like a good idea 20 years ago suddenly looks clunky and rattling now. There is already a better formatting mechanism in C++20 (std::format). – n. m. could be an AI Nov 15 '19 at 20:25
-1

Use std::cout advice might be an answer. Unfortunately, that does not compile anymore.

For me it compiles well (I tested on experimental GCC 10.0.0 on Wandbox) but does not print what you might expect/want.


I have read this SO answer that states that char8_t is implemented the same way as an unsigned char despite they are not the same type (this is not a typedef of unsigned char).

Knowing this, you could write something like this overload:

#include <iostream>

std::ostream & operator<<(std::ostream & os, const char8_t & c8)
{
    return os << static_cast<unsigned char>(c8);
}

Then you should be able to write something like:

char8_t a = 'u';
std::cout << a << std::endl;

And it will output:

u

instead of

117

I did the test here.

I think you should be able to do something equivalent for char8_t * (edit: example here).


Please let me know if I did not catch your point.

Fareanor
  • 5,900
  • 2
  • 11
  • 37
  • 1
    If you google for `char8_t`, streams, cout etc you'll find that the C++ 20 committee hasn't decided what to do for output yet and probably *doesn't* want to take a decision for this version. There are similar (possibly duplicate) questions in SO too. Different OSs handle output differently too. Windows for example uses UTF16, even if the *old console* converted the text to the user's locale. Linux doesn't use UTF16, so what should *C++* do? – Panagiotis Kanavos Nov 15 '19 at 15:26
  • In fact, the *correct* answer in the question you link to is the *other* one. That was provided by the *author* of the proposals. – Panagiotis Kanavos Nov 15 '19 at 15:27
  • @PanagiotisKanavos Both are correct. The author of the proposal has just add a missing information about the aliasing rule. But in my snippet, I did not violate it since I do not use `char8_t*` to alias something else. Since the implementation is similar of the one of `unsigned char` I don't see anything that prevent me of casting a `char8_t` into an `unsigned char`, same thing about casting `char8_t*` into `unsigned char *`. Size, alignment, ... are the same, and this way I do not violate the strict aliasing rule. – Fareanor Nov 15 '19 at 15:34
  • 1
    Other discussions show that this is no longer allowed, eg [this one](https://stackoverflow.com/questions/56613226/outputting-char8-t-const-to-cout-and-wcout-one-compiles-one-not) - the operators were deleted so what do we do now? Tom Honermann answers again and the answer is `we don't yet have consensus for what the behavior of the deleted overloads should be`. Perhaps we should ping him – Panagiotis Kanavos Nov 15 '19 at 15:40
  • @PanagiotisKanavos If the compiler forbids user-made overload of `operator<<(std::ostream &, const char8_t *)`, nothing prevents you to write `std::cout << reinterpret_cast(u8"Hello world");` directly. – Fareanor Nov 15 '19 at 15:44
  • That's ugly, but... I've answered many R-related questions about mangled text understand that ugly is probably better than allowing people to just output text without realizing they need to take encoding into account. The product itself understands encodings and codepages (now) but many libraries don't, resulting in mangled text. The *code* itself uses `char*` for Linux` and `wchar*` for Windows through #ifdefs. I wonder how they'll handle char8_t – Panagiotis Kanavos Nov 15 '19 at 15:55
  • That is the current state of affairs indeed: `we don't yet have consensus for what the behavior of the deleted overloads should be...` ... But and again that does not mention the `printf` family. **Fortunately**. In essence, whatever hack one does that might not work anymore when committee makes up its collective mind. – Chef Gladiator Nov 15 '19 at 16:45
  • @Fareanor you did catch the point. Your kind suggestions, do show that things are "up in the air" for `char8_t`. In the meantime, my safe passage is to use `printf`. It seems compiler vendors have done that too. In a covert way :) – Chef Gladiator Nov 15 '19 at 17:01
  • 1
    @Fareanor please see -- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd -- your sample does not work for wide utf8 glyphs. Problem is, as per the current status quo, `char8_t` is not implemented, `char8_t *` is. -- let me repeat: there is no implemented type to hold a single element from a `char8_t *` sequence. – Chef Gladiator Nov 15 '19 at 22:27
  • @ChefGladiator Ah right, I did not know that (you learned me something today :) ). But anyway, if you use [this overload instead](https://wandbox.org/permlink/ydvg6iCcKKcm0sgB), it should work for `char8_t*` (This is the second example I have given in my answer) But if `printf()` does the job with your compiler and that you do prefer to use it, then it's fine too :) – Fareanor Nov 16 '19 at 17:22
  • @Fareanor no worries. As long as C++20 is not officially out we will not be certain of the final solution. Thanks for your involvement. – Chef Gladiator Nov 16 '19 at 22:57