Personally, I think all the char8_t stuff in C++ is unusable practically!
With the current standard combined with OS support, I would recommend to avoid it, if possible.
But that is not all yet. There is more critic:
Unfortunately the C++ standard itself deprecates its own conversion support before it offers a replacement!
For example, the support in std::filesystem by using an utf-8 encoded standard string (not u8string) is deprecated (std::filesystem::u8path). With that even to use utf-8 encoded std::string is a pain because you must always convert it from one to another and back again!
To your questions. It depends what you want to do. If you want have a std::string which is utf-8 encoded but you only have an std::u8string, then you can simply do the following (no reinterpret_cast needed):
std::string convert( std::u8string str )
{
return std::string(str.begin(), str.end());
}
But here, I personally would expect, that the standard would offer a move constructor in std::string taking a std::u8string. Because otherwise you always must make a copy with an extra allocation for the unchanged data.
Unfortunately the standard does not offer such simple things. They are forcing the users to do uncomfortable and expensive stuff.
The same is true, if you have a std::string and you have 100% verified that it is valid utf-8 then you can direct convert it:
std::u8string convert( std::string str )
{
return std::u8string( str.begin(), str.end() );
}
During writing the long answer I realized that it is even more bad than I though when it comes to conversion! If you need to do a real conversion of the encoding it turns out that std::u8string is not supported at all.
The only way possible (that is my research result so far) is to use std::string as the data holder for the conversion, since the available routines are working on char and NOT on char8_t!
So, for the conversion from std::string to std::u8string you must do the following:
- Use std::mbrtoc16 or std::std::mbrtoc32 for convert narrow char to either UTF-16 or UTF-32.
- Use std::codecvt_utf8 to produce an UTF-8 encoded std::string.
- Finally use the routine above to convert from UTF-8 encoded std::string to std::u8string.
For the other way round from std::u8string to std::string you must do the following:
- Use the routine above to create a UTF-8 encoded std::string.
- Use std::codecvt_utf8 to create an UTF-16/32 string.
- And finally use std::c16rtomb or std::c32rtomb to produce a narrow encoded std::string.
But guess what? The codecvt routines are deprecated without a replacement...
So, personally, I would recommend to use the Windows API for it and use std::string only (or on Windows std::wstring). Usually only on Windows the std::string / char is encoded with a Windows code page and everywhere else you can normally expect it is UTF-8 (except maybe for Mainframes and some very rare old systems).
The conclusion can only be: Don't mess around with char8_t and std::u8string at all. It is practically unusable.