0

I'm in the process of upgrading my code base to C++20 and would like to make use of std::u8string/char8_t. I'm using a 3rd-party library that takes and returns UTF-8 strings in its API, however it hasn't been updated to C++20 yet and thus takes and returns the UTF-8 strings as regular std::strings instead of std::u8strings.

Converting std::u8string to std::string is pretty straight-forward, as the u8strings buffer may be accessed through a char* pointer, so

std::u8string u8s = get_data();
std::string s(reinterpret_cast<char const*>(u8s.data()), u8s.size());

is valid code. However, as far as I'm aware char8_t does not have the aliasing exemption that std::byte and char have, thus

std::string s = get_data();
std::u8string u8s{reinterpret_cast<char8_t const*>(s.data()), s.size());

is not valid.

I've resorted to

std::string s = get_data();
std::u8string u8s(s.size(), u8'\0');
std::memcpy(u8s.data(), s.data(), s.size());

for now, but that seems unnecessarily inefficient given that this first initializes the memory to all zeroes before writing the actual data into it.

Is there a way to avoid the initialization to all zeroes or another way to convert between std::string and std::u8string altogether?

Corristo
  • 4,911
  • 1
  • 20
  • 36
  • 1
    `u8string u8s(s.begin(), s.end());` maybe. I'm assuming there's no problem converting `char` to `char8_t` but I'm not familiar with `char8_t`. – john Sep 24 '20 at 14:44
  • `Converting std::u8string to std::string is pretty straight-forward,` only if text inside `std::string` also uses `UTF-8` encoding. – Marek R Sep 24 '20 at 15:00
  • 2
    @MarekR Sure, but that was the assumption I explicitly stated before. – Corristo Sep 24 '20 at 15:02

1 Answers1

2

u8string u8s(s.begin(), s.end()) should work just fine. You don't need the cast. The constructor is templated, and char implicitly converts to char8_t.

The underlying type of char8_t being unsigned char is not a problem even if char is a signed type.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • Thanks. For some reason I assumed that `char8_t` and `char` aren't implicitly convertible to one-another just because the pointers `char8_t*` and `char*` aren't... – Corristo Sep 24 '20 at 15:04