29

If I have a string:

std::string s = u8"你好";

and in C++20,

std::u8string s = u8"你好";

how std::u8string will be different from std::string?

user963241
  • 6,758
  • 19
  • 65
  • 93
  • 3
    It uses utf-8 instead of some system dependent encoding (which may or may not be utf-8). – Shawn Jun 03 '19 at 03:31
  • 2
    Guess the first one won't even compile in C++20, since `u8""` literal is of type `const char8_t[N]` and cannot be casted to std::string. – halfelf Jun 03 '19 at 03:37
  • 3
    From [cppreference](https://en.cppreference.com/w/cpp/string/basic_string), `std::string` is defined as `std::basic_string`, while `std::u8string` is defined as `std::basic_string`. So, I guess the question is `char vs char8_t`. And the motivation, I think, you could find here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html – Amadeus Jun 03 '19 at 03:41
  • 1
    That means `std::string` is implementation-defined and `std::u8string` is not? – user963241 Jun 03 '19 at 04:04
  • correct. See [Is the u8 string literal necessary in C++11](https://stackoverflow.com/q/13444930/995714) – phuclv Jun 03 '19 at 07:46

1 Answers1

32

Since the difference between u8string and string is that one is templated on char8_t and the other on char, the real question is what is the difference between using char8_t-based strings vs. char-based strings.

It really comes down to this: type-based encoding.

Any char-based string (char*, char[], string, etc) may be encoded in UTF-8. But then again, it may not. You could develop your code under an assumption that every char* equivalent will be UTF-8 encoded. And you could write a u8 in front of every string literal and/or otherwise ensure they're properly encoded. But:

  1. Other people's code may not agree. So you can't use any library that might return char*s that don't use UTF-8 encoding.

  2. You might accidentally violate your own precepts. After all, char not_utf8[] = "你好"; is conditionally supported C++. The encoding of that char[] will be the compiler's narrow encoding... whatever that is. It may be UTF-8 on some compilers and something else on others.

  3. You can't tell other people's code (or even other people on your team) that this is what you're doing. That is, your API cannot declare that a particular char* is UTF-8-encoded. This has to be something the user assumes or has otherwise read in your documentation, rather than something they see in code.

Note that none of these problems exist for users of UTF-16 or UTF-32. If you use a char16_t-based string, all of these problems go away. If other people's code returns a char16_t string, you know what they're doing. If they return something else, then you know that those things probably aren't UTF-16. Your UTF-16-based code can interop with theirs. If you write an API that returns a char16_t-based string, everyone using your code can see from the type of the string what encoding it is. And this is guaranteed to be a compile error: char16_t not_utf16[] = "你好";

Now yes, there is no guarantee of any of these things. Any particular char16_t string could have any values in it, even those that are illegal for UTF-16. But char16_t represents a type for which the default assumption is a specific encoding. Given that, if you present a string with this type that isn't UTF-16 encoded, it would not be unreasonable to consider this a mistake/perfidy by the user, that it is a contract violation.

We can see how C++ has been impacted by lacking similar, type-based facilities for UTF-8. Consider filesystem::path. It can take strings in any Unicode encoding. For UTF-16/32, path's constructor takes char16/32_t-based strings. But you cannot pass a UTF-8 string to path's constructor; the char-based constructor assumes that the encoding is the implementation-defined narrow encoding, not UTF-8. So instead, you have to employ filesystem::u8path, which is a separate function that returns a path, constructed from a UTF-8-encoded string.

What's worse is that if you try to pass a UTF-8 encoded char-based string to path's constructor... it compiles fine. Despite being at best non-portable, it may just appear to work.

char8_t, and all of its accoutrements like u8string, exist to allow UTF-8 users the same power that other UTF-encodings get. In C++20, filesystem::path will get overloads for char8_t-based strings, and u8path will become obsolete.

And, as an added bonus, char8_t doesn't have special aliasing language around it. So an API that takes char8_t-based strings is certainly an API that takes a character array, rather than an arbitrary byte array.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Undefined B: `char8_t utf8 = u8'你'; ` , -- OK:`char16_t utf16 = u'你';`, -- OK: `char32_t utf32 = U'你';` -- Why is `char8_t` based on `unsigned char` ? – Chef Gladiator Apr 01 '20 at 19:55
  • 5
    @ChefGladiator: As opposed to what? `char8_t` is intended to be a UTF-8 code *unit*, not a Unicode codepoint. And UTF-8 code units are 8 bytes in size, and `unsigned char` is require to be at least that big. And there are plenty of codepoints that would fail with `char16_t` too. – Nicol Bolas Apr 01 '20 at 21:19
  • Why are you asking me? You are selling `char8_t` not me. As far as I am concerned I see no reason to use C++20 `char8_t`. It seems like a slow train crash ... 8 years long. – Chef Gladiator Apr 01 '20 at 21:53
  • 8
    @ChefGladiator: I'm not "selling" anything. I'm explaining what the type does and is for. The fact that a type could be used in a way that it isn't meant to be used does not mean it cannot be used in the way it is actually intended to be used. `char8_t` is meant to be a UTF-8 code unit, just like `char16_t` is a UTF-16 code unit. What else could it be? If you don't like the type, that's your prerogative, but that doesn't change what the type does and how it is intended to be used. – Nicol Bolas Apr 01 '20 at 22:08
  • I beg to differ Nicol. It appears you are selling WG21 decision making on encoding and C++. It just appears to be a mess when compared to modern languages. There was absolutely no reason to move away from the `char` and standardization of execution character set to utf-8. The result is a mountain of a technical debt. Sorry ... – Chef Gladiator Apr 01 '20 at 23:59
  • 5
    @ChefGladiator: OK, how exactly would I factually explain what `char8_t` and its attendant typedefs are without "selling WG21 decision making"? I'm trying to provide facts, not opinions like whether it would have been a better idea to just force everyone to make `char` UTF-8 (though I fail to see how that would allow `char utf8 = '你'` to be any more legitimate than the `char8_t` version, so that criticism seems rather off base). My point is that I'm pushing *facts*; you're trying to push an agenda. – Nicol Bolas Apr 02 '20 at 01:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/210773/discussion-between-nicol-bolas-and-chef-gladiator). – Nicol Bolas Apr 02 '20 at 01:38
  • One problem with this reply -- std::u8strings may, or may not, be UTF-8. std::string may,.or may not, be UTF-8, so I don't believe there are any guarantees you have now you didn't have before. – Chris Jefferson Dec 15 '22 at 14:23
  • 1
    @ChrisJefferson: It "may not" be UTF-8 in the exact same way that a `char16_t *` "may not" be UTF-16. You *can* create such a string, but it requires *actual effort*. By contrast, it is trivially easy to create a `char*` that isn't required to be UTF-8. By deliberately using the `char8_t` type, you are making a promise about the meaning of the data you're putting into it. You can lie, but that requires way more effort than `char`. – Nicol Bolas Dec 15 '22 at 14:32