6

Let's consider the following code listing the directory contents of the path given as the first argument to the program:

#include <filesystem>
#include <iostream>

int main(int argc, char **argv)
{

    if(argc != 2)
        std::cerr << "Please specify a directory.\n";

    for(auto& p: std::filesystem::directory_iterator(argv[1]))
        std::cout << p << '\n';

}

On first sight this seems to be very lean, portable and conforming to the C++ standard (please ignore that it does not catch exceptions if the directory does not exist).

However, there seem to be a few pitfalls. In particular, the C++ standard does not seem to mandate that the encoding of argv[1] matches that accepted by std::filesystem::path constructors nor does it seem to mandate that the encoding returned by std::filesystem::path::string() matches that accepted by std::cout.

Quite the opposite, the standard seems to introduce the new term "native encoding" which may be different from the execution character set encoding and is defined as:

The native encoding of a narrow character string is the operating system dependent current encoding for pathnames ([fs.class.path]).

From my reading of the standard no conversion between encodings takes place if std::filesystem::path::value_type matches the char type of argv[1] (which is true on any POSIX system).

This seems to allow, for example, a conforming implementation in which the execution character set encoding (and hence the encoding of argv[1] and that accepted by std::cout) is EBCDIC, but the encoding of strings accepted and provided by the filesystem library is ISO 8859-1, with no conversion performed between the two, making the filesystem library essentially useless. Worse yet, there is no way to figure out if the two encodings are the same or not.

This can even get dangerous if you start to write utilities which delete files and the to be deleted file provided by argv[1] matches a completely different file when it's interpreted in the native encoding of the filesystem library.

Note that I'm not concerned about filesystems using different encodings than those used by programs. My concern is that the standard does not seem to mandate any conversion of those encodings.

The u8path() and u8string() functions are of no use here either because the standard also provides no way to convert between UTF-8 and the execution character set encoding (used by argv[1] and std::cout).

Is there any portable, encoding agnostic and standard compliant way to do this?

Contter
  • 61
  • 2
  • Speaking of standards, `ls` will show you the current working directory if given no arguments, it won't give you flack for not specifying it. Also if you're working with EBCDIC and C++ together I'm impressed. – tadman Nov 15 '18 at 16:40
  • 2
    Yes, there is no portable way to write `ls` application in C++. Moreover, my experience tells me that there is no portable way to write any complex application in C++ - you will always have to rely on things which are not specified by C++ standard, either directly, or hidden inside third-party libraries like boost. In my opinion, this greatly contrasts C++ with languages like Java. – SergeyA Nov 15 '18 at 16:41
  • @SergeyA Yeah, every operating system is free to make up their own rules, and they often do for reasons we'll never be able to properly explain. – tadman Nov 15 '18 at 16:43
  • The root problem is that WG21 doesn't want to rely on POSIX here. Without that, the whole notion of a file name becomes non-portable. Now this can be reasonable; on tiny embedded systems files might be identified by merely a number. – MSalters Nov 15 '18 at 17:09
  • @MSalters I understand the reason but these systems could still exist if the standard provided a way to reliably set and get that number in the execution character set encoding. – Contter Nov 15 '18 at 17:20
  • @Contter: That's a bit in reverse. You expect the implementation to adapt to the Standard, not the other way around. But you can see the implementation problem here. Should file 1 be named "1" or "\x01" ? – MSalters Nov 15 '18 at 17:26
  • @SergeyA it is possible to extend portability if you're willing to spam OS-related "#ifdef"s – Barnack Nov 15 '18 at 17:32
  • @MSalters My point was that you don't have to rely on POSIX in order for the C++ standard to mandate something like "any strings accepted by and returned from filesystem::path functions are in the execution character set encoding". If that encoding is UTF-8, then "\x01" would be U+0001 which is a control character and not a number, and since the file is not identified by a control character but by a number u8"1" would be the correct filename. – Contter Nov 15 '18 at 17:39
  • @Contter: That's entirely reasonable, but the problem is that both `"1"` and `'\x01`' can be in the execution character set encoding at the same time. You can't assume `\x01` is a control character, ASCII is not mandatory. For all you know, `\x01` is `A`. And you can't tell what the OS will put in with `argv`, that's up to the OS and outside the scope of C++ – MSalters Nov 15 '18 at 17:44
  • @MSalters: I'm aware that the execution character set encoding does not need to be ASCII. It was just an example. But if standard mandates what I proposed, you could reliably compare paths with string literals and it won't matter if it's in ASCII or UTF-8 or EBCDIC. Because the standard would guarantee that whatever you compare it to is in the same encoding. As of right now, you can't. There is another reason why you couldn't use the "\x01" way of encoding a number because if you encoded the file with the number 0xFF00FF that way, the second 0x00 would be interpreted as the end of the string. – Contter Nov 15 '18 at 17:54
  • @MSalters: As for argv, the standard guarantees that it points to a valid multi-byte character sequence. An implementation that doesn't enforce this is non-conforming. – Contter Nov 15 '18 at 17:55
  • @Contter: True, but an implementation can satisfy that criterium by **always** pointing to `"", nullptr`. Makes your `ls` pretty pointless. And before you dismiss that idea, that IS what happens on Windows when you double-click on a program. – MSalters Nov 15 '18 at 17:59
  • @MSalters: From my understanding that would be non-conforming. The standard states `that argc shall be the number of arguments passed to the program from the environment in which the program is run. If argc is nonzero these arguments shall be supplied in argv[0] through argv[argc-1] as pointers to the initial characters of null-terminated multibyte strings (ntmbss)` That you *can* run a program without arguments on Windows or any other operating system is of course correct, but if arguments are given no conforming implementation can always pass `"", nullptr`. – Contter Nov 15 '18 at 18:08

1 Answers1

4

No, and this is not just theoretical.

On Windows systems, paths are UTF-16, and path::value_type is wchar_t, not the char you get from char** argv. This isn't a problem by itself - path can be created from a char*. However, not every Windows file name can be expressed as a char*. Hence the program is unable to list the contents of some directories whose name cannot be expressed as char*.

Now you'd think that Linux would be better. That's actually not entirely the case - the bytes you get for a filename can depend on whether you entered them on a keyboard or via TAB completion!

MSalters
  • 173,980
  • 10
  • 155
  • 350