9

Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?

cigien
  • 57,834
  • 11
  • 73
  • 112
Sergio
  • 891
  • 9
  • 27
  • 6
    Think again what UTF-8 is... Is it not a multi-***byte*** encoding? Now what datatype in C++ typically represents a byte? Is it not `char`? And what do we have that is a string of `char`? It is `std::string`. So no specific `std::u8string` really needed. – Some programmer dude Mar 20 '17 at 09:51
  • 2
    `std::wstring` used `wchar_t`, and that size was underspecified (on some platforms, 16 and on others 32). `u16string` and `u32string` patch that hole. `std::string` is already a char, and a char is a byte aka the smallest memory unit your C++ program can address. So either `u8string` would could not (efficiently) exist, or it would be identical to `std::string`, on a given platform (really, both), assuming `CHAR_BIT >= 8`. – Yakk - Adam Nevraumont Mar 20 '17 at 17:20
  • 1
    `std::u16string` and `std::u32string` exist because C++11 added new data types for them - `char16_t` and `char32_t`, respectively. No new data type was added for handling UTF-8 (just a new `u8` prefix for literals). Historically, `std::string` has always been used for 8bit string data, and that has not changed. But if you really want a `u8string` type, there is nothing stopping you from declaring your own `typedef`/`using` alias for it. – Remy Lebeau Mar 21 '17 at 00:12

3 Answers3

16

C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:

UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.

Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.

The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.

lz96
  • 2,816
  • 2
  • 28
  • 46
1

Because the C/C++ standard committees don't care about valid UTF-8 sequences and comparisons enough yet. For them strcmp((char*)utf8, (char*)other) is enough, even if they would be same if normalized, or even if one is invalid UTF-8.

Neither about proper identifiers, UTF-8 sequences that should be identifiable, like pathnames. For them "Café" is not the same as "Café", when they have different bytes. "e\x301" vs "\xe9". For u8ident that is wrong, for u8string it's arguable. At least validity needs to be checked, normalization can be cached. It's a rare case.

Not even coreutils can that yet properly, most filesystems treat names as binary, which is a security risk.

See e.g. https://crashcourse.housegordon.org/coreutils-multibyte-support.html or http://perl11.github.io/blog/foldcase.html

rurban
  • 4,025
  • 24
  • 27
0

C++20 adds std::u8string. However, I would recommend using std::string instead because char8_t is poorly supported in the standard and not supported by any system APIs at all (and will likely never be because of compatibility reasons). On most platforms normal char strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8 which will give you portable Unicode support on major operating systems.

In addition to poor support in the standard, on Windows with MSVC u8 strings can be silently corrupted. For example:

std::cout << u8"Привет, мир!\n";

will almost definitely give you an invalid result such as ╨а╤Я╨б╨В╨а╤С╨а╨Ж╨а┬╡╨бтАЪ, ╨а╤Ш╨а╤С╨б╨В!.

vitaut
  • 49,672
  • 25
  • 199
  • 336
  • 2
    The statement that MSVC silently corrupts `u8` strings is not accurate. Rather, there are scenarios in which [mojibake](https://en.wikipedia.org/wiki/Mojibake) can lead to surprising results. By default, MSVC uses the Active Code Page (ACP; e.g., Windows-1252) as the encoding of source files. Compilation of a UTF-8 source file without the `/source-charset:utf-8` option will cause `u8` literals to be (incorrectly) converted from the ACP to UTF-8. Further, the Windows console (not MSVC) will interpret output according to its encoding (e.g., CP437) producing results like @vitaut indicated. – Tom Honermann Dec 31 '20 at 02:15
  • The encoding confusion that produces the results @vitaut indicated is the reason that the `wchar_t`, `char8_t`, `char16_t`, and `char32_t` formatted output inserters are deleted. [ostream.general](http://eel.is/c++draft/ostream.general). – Tom Honermann Dec 31 '20 at 02:22