char8_t and utf8everywhere: How to convert to const char* APIs without invoking undefined behaviour?

Question

As this question is some years old Is C++20 'char8_t' the same as our old 'char'?

I would like to know, what is the recommended way to handle the char8_t and char conversion right now? boost::nowide (1.80.0) doesn´t not yet understand char8_t nor (AFAIK) boost::locale.

As Tom Honermann noted that

reinterpret_cast<const char   *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text");   // Undefined behavior.

So: How do i interact with APIs that just accept const char* or const wchar_t* (think Win32 API) if my application "default" string type is std::u8string? The recommendation seems to be https://utf8everywhere.org/.

If i got a std::u8string and convert to std::string by

std::u8string convert(std::string str)
{
    return std::u8string(reinterpret_cast<const char8_t*>(str.data()), str.size());
}
std::string convert(std::u8string str)
{
    return std::string(reinterpret_cast<const char_t*>(str.data()), str.size());
}

This would invoke the same UB that Tom Honermann mentioned. This would be used when i talk to Win32 API or any other API that wants some const char* or gives some const char* back. I could go all conversions through boost::nowide but in the end i get a const char* back from boost::nowide::narrow() that i need to cast.

Is the current recommendation to just stay at char and ignore char8_t?

BTW when talking with the Win32 API in general you don't want to reinterpret an UTF-8 string and pass it to ANSI APIs _unless you are 100% sure that the thread codepage is set to UTF-8_ (which wasn't supported until some Windows 10 version); the safe way is to convert them to wide strings and then only call W versions of Win32 APIs. — Matteo Italia, Nov 07 '22 at 09:29
There are a lot other libraries that expect `const char*` or give it back. Win32 is just an example. — schorsch_76, Nov 07 '22 at 09:41
What is the point of these `reinterpret_cast`? You are copying the string data in your conversion functions anyway, so just pass `std::begin(str)` and `std::end(str)` as an iterator range to the constructors. — user17732522, Nov 07 '22 at 09:50
@schorsch_76: the point is that for each and every library that accepts `char *` you have to check if they are encoding agnostic or if they actually expect them to be in UTF-8 or in "local encoding"; in the first two cases you can pass them without conversion, in the latter case you are in for trouble. — Matteo Italia, Nov 07 '22 at 09:54
char8_t and char are distinct types in C++20 mode. They are not implicit convertable. — schorsch_76, Nov 07 '22 at 09:55
@schorsch_76 They are both integral types and so implicit conversion from one to the other is possible. (This does not apply to pointers to these types obviously.) In both functions simply `return {std::begin(str), std::end(str)};` will work fine. — user17732522, Nov 07 '22 at 09:59
But as pointed out by the other comments, while `char8_t`/`std::u8string` implies that the data is UTF-8 encoded, `char`/`std::string` make no claim on the encoding and so one has to manually verify whether UTF-8 is what the API expects (Non-unicode encodings are common.). Otherwise one has to re-encode the string appropriately for the API. For `wchar_t`/`std::wstring` this will always be the case. They (at least in practice) can't be UTF-8 because `wchar_t` is larger than a UTF-8 code unit. (Typically it is UTF-16 or UTF-32 depending on platform.) — user17732522, Nov 07 '22 at 10:05

Nicol Bolas · Accepted Answer · 2023-01-28T14:25:02.010

-1

This would invoke the same UB that Tom Honermann mentioned.

As pointed out in the post you referred to, UB only happens when you cast from a char* to a char8_t*. The other direction is fine.

If you are given a char* which is encoded in UTF-8 (and you care to avoid the UB of just doing the cast for some reason), you can use std::transform to convert the chars to char8_ts by converting the characters:

std::u8string convert(std::string str)
{
    std::u8string ret(str.size());
    std::ranges::transform(str, ret.begin(), [](char c) {return char8_t(c);});
    return ret;
}

C++23's ranges::to will make using a named return variable unnecessary.

For dealing with wchar_t interfaces (which you shouldn't have to, since nowadays UTF-8 support exists through narrow character interfaces on Windows), you'll have to do an actual UTF-8->UTF-16 conversion. Which you would have had to do anyway.

edited Jan 28 '23 at 14:25

answered Nov 07 '22 at 14:43

Nicol Bolas

449,505
63
781
982

Tom anserwed in the other thread https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char @schorsch_76, the C and C++ standards currently lack interfaces for converting between the char (locale-based) encoding and UTF-8. Work on such interfaces is underway. Low level conversion interfaces for C are being pursued by WG14 via N3031, Once available, higher level interfaces like those specified in P1629 will hopefully be adopted by WG21 for C++. In the meantime, utilities like iconv or other conversion libraries are required. – Tom Honermann yesterday – schorsch_76 Nov 09 '22 at 08:40
@schorsch_76: Sure, but your question is built on the assumption that you've been given a `char*` array that *is encoded* in UTF-8. You're just trying to make the type match the data. A cast would never be able to do narrow-to-UTF-8 trans-coding, so UB or not, it wouldn't be viable. – Nicol Bolas Nov 09 '22 at 14:31
First, the cast char8_t(c) will not convert to utf-8. The cast is really dangerous if invalid utf-8 code sequences are present in the original char sequence! Second: An advise to use the "narrow character interfaces on Windows" is total misinformation. On Windows you must use the Unicode interface. That is the only valid advice by Microsoft itself. The outdated ANSI interface is using code pages and is a compatibility relict by the Windows 95 OS line which was merged into the NT OS line with Windows XP. – TeaAge Solutions Jan 28 '23 at 07:11
@TeaAgeSolutions: "*The cast is really dangerous if invalid utf-8 code sequences are present in the original char sequence!*" You may not have noticed the statement, "If you are given a char* which is encoded in UTF-8". So there is no danger. "*On Windows you must use the Unicode interface. That is the only valid advice by Microsoft itself.*" [Incorrect since May 2019.](https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page) Windows's "ANSI" interfaces can be set to accept UTF-8. – Nicol Bolas Jan 28 '23 at 14:24
@NicolBolas: Its wrong. "UTF-8 code page" is a beta feature. Also, from the technical aspect it is not a real code page internally but an emulation. Also, for the legacy ANSI API still the same restrictions apply for e.g. file naming and maximum file path length: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew You must use the Unicode API for have the normal set of features. – TeaAge Solutions Jan 29 '23 at 07:20

score -1 · Answer 2 · answered Jan 28 '23 at 08:29

Personally, I think all the char8_t stuff in C++ is unusable practically!

With the current standard combined with OS support, I would recommend to avoid it, if possible.

But that is not all yet. There is more critic:

Unfortunately the C++ standard itself deprecates its own conversion support before it offers a replacement! For example, the support in std::filesystem by using an utf-8 encoded standard string (not u8string) is deprecated (std::filesystem::u8path). With that even to use utf-8 encoded std::string is a pain because you must always convert it from one to another and back again!

To your questions. It depends what you want to do. If you want have a std::string which is utf-8 encoded but you only have an std::u8string, then you can simply do the following (no reinterpret_cast needed):

std::string convert( std::u8string str )
{
    return std::string(str.begin(), str.end());
}

But here, I personally would expect, that the standard would offer a move constructor in std::string taking a std::u8string. Because otherwise you always must make a copy with an extra allocation for the unchanged data. Unfortunately the standard does not offer such simple things. They are forcing the users to do uncomfortable and expensive stuff.

The same is true, if you have a std::string and you have 100% verified that it is valid utf-8 then you can direct convert it:

std::u8string  convert( std::string str )
{
    return std::u8string( str.begin(), str.end() );
}

During writing the long answer I realized that it is even more bad than I though when it comes to conversion! If you need to do a real conversion of the encoding it turns out that std::u8string is not supported at all.

The only way possible (that is my research result so far) is to use std::string as the data holder for the conversion, since the available routines are working on char and NOT on char8_t!

So, for the conversion from std::string to std::u8string you must do the following:

Use std::mbrtoc16 or std::std::mbrtoc32 for convert narrow char to either UTF-16 or UTF-32.
Use std::codecvt_utf8 to produce an UTF-8 encoded std::string.
Finally use the routine above to convert from UTF-8 encoded std::string to std::u8string.

For the other way round from std::u8string to std::string you must do the following:

Use the routine above to create a UTF-8 encoded std::string.
Use std::codecvt_utf8 to create an UTF-16/32 string.
And finally use std::c16rtomb or std::c32rtomb to produce a narrow encoded std::string.

But guess what? The codecvt routines are deprecated without a replacement...

So, personally, I would recommend to use the Windows API for it and use std::string only (or on Windows std::wstring). Usually only on Windows the std::string / char is encoded with a Windows code page and everywhere else you can normally expect it is UTF-8 (except maybe for Mainframes and some very rare old systems).

The conclusion can only be: Don't mess around with char8_t and std::u8string at all. It is practically unusable.

"*For example, the support in std::filesystem by using an utf-8 encoded standard string (not u8string) is deprecated (std::filesystem::u8path).*" ... yes. Because they want you to use a `char8_t`-based string. That is, you aren't supposed to have a "utf-8 encoded standard string (not u8string)". Saying that something is unusable because there are issues when you *don't use it* is tautological and nonsensical. — Nicol Bolas, Jan 28 '23 at 14:29
Since std::string is allowed to be utf-8 I just want to use it normally as utf-8. All APIs expecting std::string. char8_t is a nightmare and a really bad design. There is no need to introduce artificial types which are not interchangeable but have the exact same underlying type. If and only if ONE char8_t would assemble ONE Unicode code point as UTF-8, then there would be a reason to go the hard way. But one char8_t is just one byte regardless of the text character. Why take the freedom of choice and the simplicity out of it? If std::string is bad, then it must be deprecated and removed. — TeaAge Solutions, Jan 29 '23 at 06:49

char8_t and utf8everywhere: How to convert to const char* APIs without invoking undefined behaviour?

2 Answers2