1

So I am trying to check if a given file exists or not. Following this answer I tried GetFileAttributesW. It works just fine for any ascii input, but it fails for ß, ü and á (and any other non-ascii character I suspect). I get ERROR_FILE_NOT_FOUND for filenames with them and ERROR_PATH_NOT_FOUND for pathnames with them, as one would expect if they didn't exists.

I made 100% sure that they did. I spend 15 minutes on copying filenames to not make typos and using literals to avoid any bad input. I couldn't find any mistake.

Since all of these characters are non-ascii characters I stopped trying, because I suspected I might have screwed up with encodings. I just can't spot it. Is there something I am missing? I link against Kernel32.lib

Thanks!

#include <stdio.h>
#include <iostream>
#include <string>
#include "Windows.h"


void main(){
    while(true){
        std::wstring file_path;
        std::getline(std::wcin, file_path);

        DWORD dwAttrib = GetFileAttributesW(file_path.data());
        if(dwAttrib == INVALID_FILE_ATTRIBUTES){
            printf("error: %d\n", GetLastError());
            continue;
        }

        if(!(dwAttrib & FILE_ATTRIBUTE_DIRECTORY))
            printf("valid!\n");
        else
            printf("invalid!\n");
    }
}
pulp_user
  • 2,454
  • 3
  • 14
  • 18
  • *using literals to avoid any bad input.* -- Does that mean you typed in the character in the source code? I don't think that's a good idea, as you don't know what the compiler did to that character literal. How about first writing a small `FindFirstFile / NextFile` program and see what you get back? Then take that flle name that is returned and call `GetFileAttributes` on that name. – PaulMcKenzie Oct 26 '17 at 19:42
  • `stdio.h` and friends are legacy C compatiblity headers - use `cstdio` and friends instead. – tambre Oct 26 '17 at 19:42
  • Try it in this way (with constant Unicode string) `DWORD dwAttrib = GetFileAttributesW( L"c:\\dir\\your_ß_file" );`. If it works, the problem is with `wstring` conversions or `getline`. – i486 Oct 26 '17 at 19:51
  • @PaulMcKenzie "don't know what the compiler did to that character literal": Of course, you know. You use your chosen, specific character encoding for the source code, tell that to the compiler and also tell the compiler which encoding to transform it to. (See this [answer](https://stackoverflow.com/a/12217048/2226988).) This isn't an option even if you use the compiler's defaults. – Tom Blodget Oct 26 '17 at 20:13
  • @PaulMcKenzie I got myself the filename in question by using your suggested `FindFirstFile / NextFile` method. The Character that should be/is the "ß" is "▀" in the output (i hope that it can be displayed on a website) This doesn't look like a generic "i don't know", but a specific unicode character. If I use this filename as the input to my programm everything works. I still don't know why it this is a different character than expected though. – pulp_user Oct 26 '17 at 20:26
  • @tambre you are right of course, but since I only do debugprints with those I don't think it matters for my problem. The only io stuff really involved is the `std::getline` which should come from `` – pulp_user Oct 26 '17 at 20:28
  • @pulp_user -- Check the value of the character instead of how it is displayed. Does what the value of "ß" represents make sense? – PaulMcKenzie Oct 26 '17 at 20:30
  • Some more: Instead of "ü" i get "³" "ö" -> "÷" and "á" -> "ß" (ironically). Maybe byteorder is somehow screwed with? I guess I'll look at binary next. – pulp_user Oct 26 '17 at 20:32
  • @PaulMcKenzie I looked at the values, they were nonsensical, but I noticed something else. I have the code open in my editor, which uses utf-8, and VS, which apperently doesn't since the characters in questions get displayed differently in both programms. If I type my filename as a literal in VS and pass it, everything works. So the compiler seems to ignore the fact that there are utf-8 characters in my editor and just interprets them as wchar? does VS use a wchar encoding by default? I still don't know why typing the name into the console fails though. But the issue looks similar. – pulp_user Oct 26 '17 at 21:08
  • There are so many conversion problems going on in this dumpster fire that debugging it is going to be very difficult. Unicode support in console i/o on Windows is extremely limited (probably the original problem and additional confusion when printf-debugging). The Microsoft compilers might not recognize UTF-8 source files as UTF-8 unless they have a UTF-8-encoded BOM (probably confounded the attempts to hard-code the file name in the application). – Adrian McCarthy Oct 26 '17 at 22:23

1 Answers1

3

It's extremely hard to make Unicode work well in a console program on Windows, so let's start by removing that aspect of it (for now).

Modify your program so that it looks like this:

#include <cstdio>
#include <iostream>
#include <string>
#include "Windows.h"

int main() {
    std::wstring file_path = L"fooß.txt";

    DWORD dwAttrib = GetFileAttributesW(file_path.data());
    if (dwAttrib == INVALID_FILE_ATTRIBUTES)
        printf("error: %d\n", GetLastError());

    if (!(dwAttrib & FILE_ATTRIBUTE_DIRECTORY))
        printf("valid!\n");
    else
        printf("invalid!\n");

    return 0;
}

Make sure this file is saved with a byte-order mark (BOM), even if you're using UTF-8. Windows applications, including Visual Studio and the compilers, can be very picky about that. If your editor won't do that, use Visual Studio to edit the file and then use Save As, click the down arrow next to the Save button, choose With Encoding. In the Advanced Save Options dialog, choose "Unicode (UTF-8 with signature) - Codepage 65001".

Make sure you have a file named fooß.txt in the current folder. I strongly recommend using a GUI program to create this file, like Notepad or Explorer.

This program works. If you still get a file-not-found message, check to make sure the temporary file is in the working directory or change the program to use an absolute path. If you use an absolute path, use backslashes and make sure they are all properly escaped. Check for typos, the extension, etc. This code does work.

Now, if you take the file name from standard input:

    std::wstring file_path;
    std::getline(std::wcin, file_path);

And you enter fooß.txt in the console window, you'll probably find that it doesn't work. And if you look in the debugger, you'll see that the character that should be ß is something else. For me, it's á, but it might be different for you if your console codepage is something else.

ß is U+00DF in Unicode. In Windows 1252 (the most common codepage for Windows users in the U.S.), it's 0xDF, so it might seem like there's no chance of a conversion problem. But the console windows (by default) use OEM code pages. In the U.S., the common OEM codepage is 437. So when I try to type ß in the console, that's actually encoded as 0xE1. Surprise! That's the same as the Unicode value for á. And if you manage to enter a character with the value 0xDF, you'll see that corresponds to the block character you reported in the original question.

You would think (well, I would think) that asking for the input from std::wcin would do whatever conversion is necessary. But it doesn't, and there's probably some legacy backward compatibility reason for that. You could try to imbue the stream with the "proper" codepage, but that gets complicated, and I've never bothered trying to make it work. I've simply stopped trying to use anything other than ASCII on the console.

anatolyg
  • 26,506
  • 9
  • 60
  • 134
Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • Thanks for this answer! I saved the file with a BOM and it worked. I discovered that I can get the same effect by passing the compiler the `-utf-8` flag, which does not require me to save every file again with a BOM. Although that might be desirable nevertheless. Not having unicode in the console is an unfortunate limitation, but it is not a big problem, so I will probably ignore it. My programm will have a GUI at some point, so I can add unicode support then. Thanks! – pulp_user Oct 27 '17 at 10:19