Consider this code, running on a Linux system (Compiler Explorer link):
#include <filesystem>
#include <cstdio>
int main()
{
try
{
const char8_t bad_path[] = {0xf0, u8'a', 0}; // invalid utf-8, 0xf0 expects continuation bytes
std::filesystem::path p(bad_path);
for (auto c : p.u8string())
{
printf("%X ", static_cast<uint8_t>(c));
}
}
catch (const std::exception& e)
{
printf("error: %s\n", e.what());
}
}
It deliberately constructs a std::filesystem::path
object using a string with incorrect UTF-8 encoding (0xf0 starts a 4-byte character, but 'a'
is not a continuation byte; more info here).
When u8string
is called, no exception is thrown; I find this surprising as the documentation at cppreference states:
- The result encoding in the case of u8string() is always UTF-8.
Checking the implementation of LLVM's libcxx, I see that indeed, there's no validation performed - the string held internally by std::filesystem::path
is just copied into a u8string
and returned:
_LIBCPP_INLINE_VISIBILITY _VSTD::u8string u8string() const { return _VSTD::u8string(__pn_.begin(), __pn_.end()); }
The GCC implementation (libstdc++) exhibits the same behavior.
Of course this is a contrived example, as I deliberately construct a path from an invalid string to keep things simple. But to my knowledge, the Linux kernel/filesystems do not enforce that file paths are valid UTF-8 strings, so I could encounter a path like that "in the wild" while e.g. iterating a directory.
Am I right to conclude that std::filesystem::path::u8string
actually does not guarantee that a valid UTF-8 string will be returned, despite what the documentation says? If so, what is the motivation behind this design?