I've been reading links as this question and of course this question on preparing for the upcoming "utf8" char type char8_t
and their corresponding string type in C++20, and can say, up to a point, that it's about time. Also that it's a mess.
Feel free to correct me where I'm wrong:
- C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's
# encoding:...
metadata), nor what Standards can it be compiled into (like say#!/bin/env g++ -std=c++14
) . - Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.
- C++11 introduces
u16"text"
andu32"text"
and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless. - C++11 also introduces
u8"text"
for producing an UTF8-encoded string... but does not even introduce either a proper UTF8 char type or string type (that's whatchar8_t
is intended to be in C++20?), so it's even uselesser than the above. - Because of all this, when
char8_t
is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether. - Even then, there's no readily available tooling (as in: not the same crap tier interface as
<random>
) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.
Given all of the above, I have some questions regarding why are we in this weird status and if it'll ever get better. Historically Unicode support has been one of the lowest points of C++.
Similarly, am wondering how useful is a poor-man's-emulation of the whole concept (disclaimer: am the maintainer of cxxomfort, I already backport lots of things. Work needs: latest MSVC target at the office is MSVC 2012).
- Why did C++ not add
char8_t
at the proper time whenu8"text"
was introduced or otherwise delay introduction ofu8
? - Alternatively, why wasn't another, non-breaking prefix like
c8"text"
introduced withchar8_t
in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case:cout<< prefix"hello world"
. - Is
char8_t
intended to functionally be (closer to) an alias ofunsigned char
or ofchar
? - If the former, is working up the way to eg.:
typedef std::basic_string<unsigned char> u8string
a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own? - What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 *for storage only*?
re: char8_t
as unsigned char
, this is more or less what I'm looking at in terms of pseudocode:
// this is here basically only for type-distinctiveness
class char8_t {
unsigned char value;
public:
non_explicit constexpr char8_t (unsigned char ch = 0x00) noexcept;
operator unsigned char () const noexcept;
// implement all operators to mirror operations on unsigned char
};
// public adapter jic
friend unsigned char to_char (char8_t);
// note we're *not* using our new char-type here
namespace std {
typedef std::basic_string<unsigned char> u8string;
}
// unsure if these two would actually be needed
// (couldn't make a compelling case so far,
// even testing with Windows's broken conhost)
namespace std {
basic_istream<char8_t> u8cin;
basic_ostream<char8_t> u8cout;
}
// we work up operator<<, operator>> and string conversion from there
// adding utf8-validity checks where needed
std::ostream& operator<< (std::ostream&, std::u8string const&);
std::istream& operator>> (std::istream&, std::u8string&);
// likely a macro; we'll see
#define u8c(ch) static_cast<char8_t>(ch)
// char8_t ch = u8c('x');
// very likely not a macro pre-C++20; can't skip utf-8 validity check on [2]?
u8string u8s (char8_t const* str); // [1], likely trivial
u8string u8s (char const* str); // [2], non-trivial
// C++20 and up
#define u8s(str) u8##str // or something; not sure
// end result:
// no, I can't even think how would one spell this:
u8string text = u8s("H€łlo Ẅørλd");
// this wouldn't work without refactoring u8string into a full specialization,
// to add the required constructor, but doing so is a PITA because
// the basic_string interface is YAIM (yet another infamous mess):
u8string text = u8"H€łlo Ẅørλd";
I've tagged this C++ as a general, but this is more about (the value of) implementation for Standards pre-C++20. More importantly, I'm not looking for "perfect" solutions or justifications; given the context, poor-man's is more than good enough.