2

C++20 added char8_t, which is (I believe) designed to help support UTF-8 better.

String constants of the form u8"abc" are required by the standard to be valid UTF-8 in a char8_t[] array. These constants can also be turned into std::u8strings.

However, I can find nothing in the C++ standard which suggests that a std::u8string either must, or even should, contain a UTF-8 string. Is there in practice any difference between a std::string and std::u8string in terms of UTF-8 support?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Chris Jefferson
  • 7,225
  • 11
  • 43
  • 66
  • [`char8_t`](https://en.cppreference.com/w/cpp/language/types#Character_types) - type for UTF-8 character representation, required to be large enough to represent any UTF-8 code unit (8 bits). It has the same size, signedness, and alignment as `unsigned char` (and therefore, the same size and alignment as `char` and `signed char`), but is a distinct type. – 273K Nov 27 '22 at 22:36
  • 1
    `std::u8string` is `std::basic_string`. – 273K Nov 27 '22 at 22:37
  • The difference is `std::string` being `std::basic_string`, `char` can be signed or unsigned type whereas `char8_t` is only unsigned type. – 273K Nov 27 '22 at 22:39
  • "_String constants of the form u8"abc" are required by the standard to be valid UTF-8_": I don't see why this should be true. Malformed UTF-8 sequences are allowed as far as I can tell, e.g. `u8"\xff"`. – user17732522 Nov 27 '22 at 23:04
  • Does this answer your question? [how std::u8string will be different from std::string?](https://stackoverflow.com/questions/56420790/how-stdu8string-will-be-different-from-stdstring) – Richard Critten Nov 28 '22 at 00:22
  • On some systems a std::string might be encoded in EBCDIC, which makes quite a difference. – BoP Nov 28 '22 at 01:25
  • @RichardCritten, that question seems to cover a much broader area, and fails to give a specific factual answer to the question I ask, but maybe it should answer this question clearly as part of its answer. – Chris Jefferson Nov 28 '22 at 06:20

1 Answers1

2

No, c++ does not require you to store valid utf8 in u8strings. From the compiler's perspective, std::u8string has exactly the same semantics as std::string.

But "in practice" you can expect functions taking a u8string argument to expect that string to be valid utf8. Even if they accept invalid utf8, they will definitely never expect your string to be latin1 encoded. The same definitely can't be said for std::string.

Chronial
  • 66,706
  • 14
  • 93
  • 99
  • That is what I thought (I will just give people a brief while to pop up with new information before accepting your answer as valid in case we both missed something). – Chris Jefferson Nov 27 '22 at 23:10