5

Consider this code, running on a Linux system (Compiler Explorer link):

#include <filesystem>
#include <cstdio>

int main()
{
    try
    {
        const char8_t bad_path[] = {0xf0, u8'a', 0};  // invalid utf-8, 0xf0 expects continuation bytes
        std::filesystem::path p(bad_path);

        for (auto c : p.u8string())
        {
            printf("%X ", static_cast<uint8_t>(c));
        }
    }
    catch (const std::exception& e)
    {
        printf("error: %s\n", e.what());
    }
}

It deliberately constructs a std::filesystem::path object using a string with incorrect UTF-8 encoding (0xf0 starts a 4-byte character, but 'a' is not a continuation byte; more info here).

When u8string is called, no exception is thrown; I find this surprising as the documentation at cppreference states:

  1. The result encoding in the case of u8string() is always UTF-8.

Checking the implementation of LLVM's libcxx, I see that indeed, there's no validation performed - the string held internally by std::filesystem::path is just copied into a u8string and returned:

_LIBCPP_INLINE_VISIBILITY _VSTD::u8string u8string() const { return _VSTD::u8string(__pn_.begin(), __pn_.end()); }

The GCC implementation (libstdc++) exhibits the same behavior.

Of course this is a contrived example, as I deliberately construct a path from an invalid string to keep things simple. But to my knowledge, the Linux kernel/filesystems do not enforce that file paths are valid UTF-8 strings, so I could encounter a path like that "in the wild" while e.g. iterating a directory.

Am I right to conclude that std::filesystem::path::u8string actually does not guarantee that a valid UTF-8 string will be returned, despite what the documentation says? If so, what is the motivation behind this design?

user4520
  • 3,401
  • 1
  • 27
  • 50
  • 2
    The general philosophy of C++ is to not make you pay for anything you don't need. Most people won't need validation of file names. – Mark Ransom Apr 21 '22 at 19:03
  • 1
    @OP What if the programmer wishes to not pay for the overhead of checking the string? Maybe in "debug mode" the check is done, but should it be done in "release mode", i.e. an optimized version of the program? – PaulMcKenzie Apr 21 '22 at 19:03
  • @PaulSanders In this example, I pass it in, yes, but I could imagine a case where I'm iterating a directory (`std::directory_iterator`) that contains a file with such an invalid name. – user4520 Apr 21 '22 at 19:29
  • Yes, that's true. The other comments posted here are more insightful than mine. – Paul Sanders Apr 21 '22 at 19:35

1 Answers1

4

The current C++ standard states in fs.path.type.cvt:

char8_­t: The encoding is UTF-8. The method of conversion is unspecified.

and also

If the encoding being converted to has no representation for source characters, the resulting converted characters, if any, are unspecified.

So, in a nutshell, anything involving the actual interpretation of the bytes making up the path is unspecified, meaning that implementations are free to handle invalid data as they see fit. So yes, std::filesystem::path::u8string() does not really guarantee that a valid UTF-8 string is returned.

Regarding the motivation: The standard says nothing about it. But one might have an idea by looking at boost::filesystem, which the standard is based on. The documentation states:

When a class path function argument type matches the operating system's API argument type for paths, no conversion is performed rather than conversion to a specified encoding such as one of the Unicode encodings. This avoids unintended consequences, etc.

I guess you are using a posix system, in which case the underlying operating system API is most likely using UTF-8 or binary filenames. Hence, inputs are kept as is so to not stumble into any conversion issues. On the other hand, Windows is using UTF-16 and hence needs to convert the string already when constructing a path, resulting in an exception when the input is an invalid UTF-8 encoding (godbolt).

Sedenion
  • 5,421
  • 2
  • 14
  • 42