49

C++11 brought us the u8 prefix for UTF-8 literals and I thought that was pretty cool a few years ago and peppered my code with things like this:

std::string myString = u8"●";

This is all fine and good, but the issue comes up in C++20 it doesn't seem to compile anymore because u8 creates a char8_t* and this is incompatible with std::string which just uses char.

Should I be creating a new utf8string? What's the consistent and correct way to do this kind of thing in a C++20 world where we have more explicit types that don't really match with the standard std::string?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
M2tM
  • 4,415
  • 1
  • 34
  • 43

6 Answers6

29

In addition to @lubgr's answer, the paper char8_t backward compatibility remediation (P1423) discusses several ways how to make std::string with char8_t character arrays.

Basically the idea is that you can cast the u8 char array into a "normal" char array to get the same behaviour as C++17 and before, you just have to be a bit more explicit. The paper discusses various ways to do this.

The most simple (but not fully zero overhead, unless you add more overloads) method that fits your usecase is probably the last one, i.e. introduce explicit conversion functions:

std::string from_u8string(const std::string &s) {
  return s;
}
std::string from_u8string(std::string &&s) {
  return std::move(s);
}
#if defined(__cpp_lib_char8_t)
std::string from_u8string(const std::u8string &s) {
  return std::string(s.begin(), s.end());
}
#endif
Fabio Fracassi
  • 3,791
  • 1
  • 18
  • 17
  • 2
    This paper is very enlightening and I'm accepting this answer because it really digs into the crux of the issue, it was hard to choose since both answers were very helpful! – M2tM Jul 01 '19 at 17:50
  • Hm. Should at least also use `std::string_view` to reduce the carnage in some cases. Even though it adds more functions. – Deduplicator Apr 24 '21 at 11:56
  • 1
    In my case, it seems the easiest option is to just remove all uses of `u8"` and assume that all `std::string` are encoded in utf8. – MasterHD Jul 10 '22 at 09:01
24

Should I be creating a new utf8string?

No, C++20 adds std::u8string. However, I would recommend using std::string instead because char8_t is poorly supported in the standard and not supported by any system APIs at all (and will likely never be because of compatibility reasons). On most platforms normal char strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8 which will give you portable Unicode support on major operating systems.

For example, you cannot even write a Hello World program using u8 strings in C++20 (https://godbolt.org/z/E6rvj5):

std::cout << u8"Hello, world!\n"; // won't compile in C++20

On Windows with MSVC and pre-C++20 the situation is even worse because u8 strings may be silently corrupted. For example:

std::cout << "Привет, мир!\n";

will produce valid UTF-8 that may or may not be displayed in the console depending on its current code page while

std::cout << u8"Привет, мир!\n";

will almost definitely give you an invalid result such as ╨а╤Я╨б╨В╨а╤С╨а╨Ж╨а┬╡╨бтАЪ, ╨а╤Ш╨а╤С╨б╨В!.

vitaut
  • 49,672
  • 25
  • 199
  • 336
  • 3
    The statement that MSVC silently corrupts strings is not accurate. Rather, there are scenarios in which [mojibake](https://en.wikipedia.org/wiki/Mojibake) can lead to surprising results. By default, MSVC uses the Active Code Page (ACP; e.g., Windows-1252) as the encoding of source files. Compilation of a UTF-8 source file without the `/source-charset:utf-8` option will cause literals to be (incorrectly) converted from the ACP to the target encoding. Further, the Windows console (not MSVC) will interpret output according to its encoding (e.g., CP437) producing results like @vitaut indicated. – Tom Honermann Dec 31 '20 at 02:32
  • The encoding confusion that produces the results @vitaut indicated is the reason that the `wchar_t`, `char8_t`, `char16_t`, and `char32_t` formatted output inserters are deleted in C++20. – Tom Honermann Dec 31 '20 at 02:33
  • Windows 10 console now has virtual terminal support for UTF-8 output (and other things like ANSI escape sequences). It's not 100% perfect yet, but it's quite usable and still improving. For now, programs must explicitly opt-in for that functionality or they'll be stuck with the code page scheme. – Adrian McCarthy Mar 03 '21 at 19:50
  • Is it a problem to write `std::cout << u8"…"` after a call to `SetConsoleOutputCP(CP_UTF8)`? That should be safe, right? (I mean pre C++20 of course) – Martini Bianco Nov 16 '21 at 16:03
  • @MartiniBianco There's a lot more to it, too much to go over in a comment. But in general: It depends which terminal the user is using. On legacy console, even in utf8 mode (which still wants wide strings and wide apis, yes thats right, read first caution [here](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmode?view=msvc-170#remarks)), it won't support multi-code points. So you are better with traditional utf16 wide-string, which supports more characters imho. – scx Dec 22 '21 at 21:25
  • `SetConsoleOutputCP` can help for `char` string but not for `char8_t` because by the literal is "corrupted" by the compiler. – vitaut Mar 30 '23 at 16:40
23

Should I be creating a new utf8string?

No, it's already there. P0482 does not only propose char8_t, but also a new specialization of std::basic_string for char8_t character types named std::u8string. So this already compiles with clang and libc++ from trunk:

const std::u8string str = u8"●";

The fact that std::string construction from a u8-literal breaks is unfortunate. From the proposal:

This proposal does not specify any backward compatibility features other than to retain interfaces that it deprecates. The author believes such features are necessary, but that a single set of such features would unnecessarily compromise the goals of this proposal. Rather, the expectation is that implementations will provide options to enable more fine grained compatibility features.

But I guess most of such initialization as above should be grep-able or be subject to some automatic clang tooling fixes.

lubgr
  • 37,368
  • 3
  • 66
  • 117
  • 6
    Oh, modern C++. What are you like. smh – Lightness Races in Orbit Jul 01 '19 at 09:59
  • 8
    *"it's already there"* - I wouldn't be so optimistic, Even though `u8string` is supposed to deal with utf8 chars exclusively it still treats them as array of bytes rather than sequence of symbols. One must reimplement indexing and other per-symbol operations or use some third party string. So `u8string` brings almost no benefits over regular `string`, especially if utf8 is used for char strings encoding – user7860670 Jul 01 '19 at 09:59
  • @VTT Good point, and I do agree. "It's already there" was meant w.r.t. fixing the `std::string = u8"..."` issue. – lubgr Jul 01 '19 at 10:02
  • 10
    Note that it's also possible to let the type of the template be deduced from the litereal: `std::basic_string str = u8"●"`. This works in both C++17 and in C++20, but resovels to a different type in each. – eerorika Jul 01 '19 at 10:32
  • @eerorika That's nice! Though to really harvest that level of flexibility, all functions dealing with standard strings would necessarily be templates depending on `Char` type... – lubgr Jul 01 '19 at 10:36
  • 1
    Thank you so much for your answer! This really seems like an oversight and feels like a rush job. – M2tM Jul 01 '19 at 17:51
  • 4
    It was incredibly important to get this in, and any proposal that was bigger than this would have been even harder to get through. Given our track record of actually breaking backward compatibility having this is a small miracle. With this building block SG16 (the Unicode/text Study group) has a basis to stand on – Fabio Fracassi Jul 01 '19 at 22:23
  • Thanks @fabio_fracassi I was able to work around the issue in the short term with a basic reinterpret_cast to const char * to achieve backward compatibility. Thankfully I already had a U8_CHAR_STR macro so though I was using this in 100 places I only had one point to edit. – M2tM Jul 02 '19 at 18:28
  • I’m hoping a comprehensive built in set of u8 utils makes it in someday. – M2tM Jul 02 '19 at 18:28
  • 3
    The revision of P0482 linked in this answer is the initial revision. The revision accepted for C++20 is [P0482R6](http://wg21.link/p0482) and it replaced the quoted text with the following: `This proposal does not specify any backward compatibility features other than to retain interfaces that it deprecates. The author believes such features are necessary, but that a single set of such features would unnecessarily compromise the goals of this proposal. Rather, the expectation is that implementations will provide options to enable more fine grained compatibility features.` – Tom Honermann Aug 12 '19 at 14:45
  • @TomHonermann Thanks! I update the citation and the link, too. – lubgr Aug 12 '19 at 15:06
3

It currently looks like utf8 everywhere advocates have been thrown under the bus, with C++20 offering yet another flawed incomplete option to consider when deciding how to deal with character encoding for portable code. char8_t further muddies some already very dirty water. The best I've been able to come up with as a stop gap with msvc optionPreview - Features from the Latest C++ Working Draft (/std:c++latest) is this...

#if defined(__cpp_char8_t)
template<typename T>
const char* u8Cpp20(T&& t) noexcept 
{ 
#pragma warning (disable: 26490)
   return reinterpret_cast<const char*>(t);
#pragma warning (default: 26490)
}
   #define U8(x) u8Cpp20(u8##x)
#else
   #define U8(x) u8##x
#endif

It is ugly, inefficient and annoying. But it allows replacing all u8"" with U8"" in legacy 'utf8 everywhere' code. I plan to shun char8_t until the offering is more coherent and complete (or forever). We should wait and see what C++20 finally settles on. At the moment char8_t is a huge disappointment.

If anyone's interested, I've posted an open source example of my own utf8 everywhere response on github (for the visual studio community). https://github.com/JackHeeley/App3Dev

2

Another way to use u8 literals as const char*, would be a user-defined literal (see https://en.cppreference.com/w/cpp/language/user_literal):

std::string operator"" S(const char8_t* str, std::size_t) {
    return reinterpret_cast< const char* >(str);
}
char const* operator"" C(const char8_t* str, std::size_t) {
    return reinterpret_cast< const char* >(str);
}

Usage: Then it can be used like this:

std::string myString = u8"●"S;


SetConsoleOutputCP(CP_UTF8);
std::cout << u8"Привет, мир!"C << std::endl;

Explanation

The code above defines two user-defined literals u8"…"S and u8"…"C (remember: the literal u8"…" in C++20 is of type const char8_t*). The S literal created a std::string and the C literal creates a const char *.

That means all literals of the form u8"…"C can be used like "…" literals, while all literals of the form u8"…"S can be used like "…"s literals.

PS: I'm not sure, if it is allowed to define literals that do not start with underscore "_". But the code ran without a problem when I tried it in Visual Studio. But all examples in cppreference are with underscore.

Martini Bianco
  • 1,484
  • 1
  • 13
  • 23
0

It may not be convenient, but you use this: (const char*)u8"こんにちは"

Or make 2 functions with arguments "const char*" and "const char8_t*"

JAMSHAID
  • 1,258
  • 9
  • 32
NiceL
  • 1