10

From this answer I learned that in C++17 we can open std::fstream using a UTF-8 path via std::filesystem::u8path. But in C++20 this function is deprecated, and we are supposed to pass const char8_t* to std::filesystem::path constructor instead.

Here comes the problem: although we can legally convert (via reinterpret_cast) any pointer to const char*, we can't do backwards: from const char* to e.g. const char8_t* (it would break strict aliasing rules). So if we have some external API returning a char-based UTF-8 representation of the filename (e.g. from a library written in C), we can't safely convert the pointer to char8_t-based one.

So, how are we supposed to convert such char-based view of UTF-8 strings to char8_t-based view of them?

Ruslan
  • 18,162
  • 8
  • 67
  • 136
  • "we can't do backwards". Why would you want to? UTF-8 string data should go in `char8_t` strings to begin with. – n. m. could be an AI Aug 22 '19 at 06:33
  • 1
    @n.m. should doesn't mean it does. There's any number of libraries designed before 2020 which don't even know of `char8_t`. – Ruslan Aug 22 '19 at 06:36
  • Old libraries may continue to use old APIs. Deprecated doesn't mean removed. – n. m. could be an AI Aug 22 '19 at 06:42
  • I guess we should wait until the standard is finalized and compilers implemented everything regarding `char8_t`. I'd bet the cast will be save on MSVC since, to my knowledge, it doesn't take advantage on the SAR and gcc&clang will have some switch to treat `char8_t` as `char` to allow aliasing. UB is UB but it is how it is with SAR and legacy->new code bridges. – ixSci Aug 22 '19 at 06:55
  • if the `char[]` data is already encoded in UTF-8, then simply `memcpy()` or `std::copy()` it as-is into a `char8_t[]` buffer. – Remy Lebeau Aug 23 '19 at 00:17
  • @n.'pronouns'm. UTF-8 string data cannot go in `char8_t` because currently there is no libraries which can read data into char8_t, we don't even have a `std::u8fstream`. – LanYi Oct 15 '20 at 22:42

2 Answers2

4

Disclaimer: I'm the author of the P0482 proposal that introduced char8_t and deprecated u8path.

Your observations are correct; it is not permissible to use reinterpret_cast to produce a char8_t pointer to a sequence of char objects. This is discussed further at https://stackoverflow.com/a/57453713/11634221.

Though std::filesystem::u8path has been deprecated in C++20, there are no plans for its imminent removal; you can continue to use it. Further, P1423 corrects an unintended consequence of the changes in P0482 and permits it to be called with ranges of both char and char8_t in C++20. As far as I'm aware, no implementors have annotated std::filesystem::u8path as deprecated (I don't know if any plan to do so).

There is no (well-formed) way to produce a char8_t pointer based view of a sequence of char. It is possible to write a range/iterator adapter that, internally, converts the individual char values to char8_t on iterator dereference. Such an adapter could satisfy the requirements of the C++17 and C++20 random access iterator requirements for a non-mutable iterator (it can't satisfy requirements for a mutable iterator because the dereference operation wouldn't be able to provide an lvalue, nor could it satisfy requirements for a contiguous iterator). Such an adapter would suffice for calls to the std::filesystem::path constructors that accept ranges. Hmm, this might be a useful enough adapter to add to https://github.com/tahonermann/char8_t-remediation.

An alternative to a view over the underlying char data is, of course, to copy it, but I can appreciate why doing so might be considered undesirable (we already tend to do a lot of copying when working with std::filesystem::path).

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10
  • 3
    `std::u8string_view` SHOULD have been this adapter. It is immutable and its original intent was insulating users from caring about the underlying string storage. Why can't we have nice things :( ? – Giovanni Funchal Oct 25 '19 at 09:56
  • Making `std::u8string_view` that adapter would have required use of `reinterpret_cast` (or similar) within its implementation. That would have prevented `std::u8string_view` from being `constexpr`. Additionally, even if we adopted some kind of compiler magic to make that work, it would have been an odd one-off unless we also extended that magic to `std::span` and other future types like `std::text_view`. – Tom Honermann Oct 26 '19 at 13:43
1

From this character types reference about char8_t:

It has the same size, signedness, and alignment as unsigned char (and. therefore, the same size and alignment as char and signed char), but is a distinct type.

Because it's a distinct type you can not convert from const char* to const char8_t* without breaking strict aliasing. But for all practical purposes, since char8_t is basically a unsigned char you can use reinterpret_cast to convert the pointer. It's wrong but will work.

For proper correctness either use char8_t to begin with, or copy the original characters into a char8_t buffer (or std::u8string).

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • Size, signedness and alignment is not what lets us defy strict aliasing rules. – Ruslan Aug 22 '19 at 06:37
  • 1
    Since it is a distinct type you can't access the cast pointer because it violates the strict aliasing rule. – ixSci Aug 22 '19 at 06:37