0

I'm using Microsoft Visual C++ 16.1 (2019 Community) and am trying to write code which will be "proper" in C++ 2020 which is expected to have a char8_t type which will be an unsigned char. I define a type like this:

using char8_t = unsigned char;

Code such as the following:

std:string data;
const char8_t* ptr = data.c_str ();

does not compile as it will not convert the signed char pointer to an unsigned char pointer without a reinterpret_cast. Is there something I can do to prepare for 2020 without having reinterpret casts all over the place?

crsguy
  • 15
  • 6
  • 2
    By the time `char8_t` comes along, there will also be a `std::u8string` specialization of `std::basic_string` for `char8_t`. Don't mix `std::string` and `std::u8string` together when it comes to handling UTF-8 strings. – Remy Lebeau Jun 25 '19 at 22:39
  • 1
    "*which will be an unsigned char*" That's not how `char8_t` works. It's a distinct type, different from `unsigned char`, though it can explicitly and losslessly be converted to/from them. – Nicol Bolas Jun 25 '19 at 22:45
  • 1
    Just `using char8_t = char;` – KamilCuk Jun 25 '19 at 22:53

2 Answers2

3

P1423 (char8_t backward compatibility remediation) documents a number of approaches that can be used to remediate the backward compatibility impact due to the adoption of char8_t via P0482 (char8_t: A type for UTF-8 characters and strings).

Because char8_t is a non-aliasing type, it is undefined behavior to use reinterpret_cast to, for example, assign a char8_t pointer to a pointer to char as in reinterpret_cast<const char8_t*>(data.c_str()). However, because char and unsigned char are allowed to alias any type, it is permissible to use reinterpret_cast in the other direction, e.g., reinterpret_cast<const char*>(u8"text").

None of the remediation approaches documented in P1423 are silver bullets. You'll need to evaluate what works best for your use cases. You might also appreciate the answers in C++20 with u8, char8_t and std::string.

With regard to char8_t not being a UTF-8 character and u8string not being a UTF-8 string, that is correct in that, char8_t is a code unit type (not a code point type) and that u8string does not enforce well-formed UTF-8 sequences. However, the intent is very much that these types only be used for UTF-8 data.

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10
1

Thanks for the comments. The comments and further research has corrected a major misconception which prompted the original question. I now understand that a 2020 char8_t is not a UTF-8 character and a 2020 u8string is not a UTF-8 string. While they may be used in a "UTF-8 string" implementation, they are not such.

Thus, it appears use of reinterpret_cast's is unavoidable, but can be hidden/isolated to a set of inline function overloads (or a set of function templates). Implementation of a utf8string object (perhaps as a template) as a distinct object is necessary (if such is not already available soemewhere).

crsguy
  • 15
  • 6