0

This is an extension of this question: fstream not opening files with accent marks in pathname

The problem is the following: a program opening a simple NTFS text file with accent marks in pathname (e.g. à, ò, ...). In my tests I'm using a file with pathname I:\università\foo.txt (università is the Italian translation of university)

The following is the test program:

#include <iostream>
#include <fstream>
#include <string>
#include <cstdio>
#include <errno.h>
#include <Windows.h>

using namespace std;

LPSTR cPath = "I:/università/foo.txt";
LPWSTR widecPath = L"I:/università/foo.txt";
string path("I:/università/foo.txt");

void tryWithStandardC();
void tryWithStandardCpp();
void tryWithWin32();

int main(int argc, char **argv) {
    tryWithStandardC();
    tryWithStandardCpp();
    tryWithWin32();

    return 0;
} 

void tryWithStandardC() {
    FILE *stream = fopen(cPath, "r");

    if (stream) {
        cout << "File opened with fopen!" << endl;
        fclose(stream);
    }

    else {
        cout << "fopen() failed: " << strerror(errno) << endl;
    }
}

void tryWithStandardCpp() {
    ifstream s;
    s.exceptions(ifstream::failbit | ifstream::badbit | ifstream::eofbit);      

    try {
        s.open(path.c_str(), ifstream::in);
        cout << "File opened with c++ open()" << endl;
        s.close();
    }

    catch (ifstream::failure f) {
        cout << "Exception " << f.what() << endl;
    }   
}

void tryWithWin32() {

    DWORD error;
    HANDLE h = CreateFile(cPath, GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

    if (h == INVALID_HANDLE_VALUE) {
        error = GetLastError();
        cout << "CreateFile failed: error number " << error << endl;
    }

    else {
        cout << "File opened with CreateFile!" << endl;
        CloseHandle(h);
        return;
    }

    HANDLE wideHandle = CreateFileW(widecPath, GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

    if (wideHandle == INVALID_HANDLE_VALUE) {
        error = GetLastError();
        cout << "CreateFileW failed: error number " << error << endl;
    }

    else {
        cout << "File opened with CreateFileW!" << endl;
        CloseHandle(wideHandle);
    }
}

The source file is saved with UTF-8 encoding. I'm using Windows 8.

This is the output of the program compiled with VC++ (Visual Studio 2012)

fopen() failed: No such file or directory
Exception ios_base::failbit set
CreateFile failed: error number 3
CreateFileW failed: error number 3

This is the output using MinGW g++

fopen() failed: No such file or directory
Exception basic_ios::clear
CreateFile failed: error number 3
File opened with CreateFileW!

So let's go to the questions:

  1. Why fopen() and std::ifstream works in a similar test in Linux but they don't in Windows?
  2. Why CreateFileW() works only compiling with g++?
  3. Does exist a cross-platform alternative to CreateFile?

I hope that opening a generic file with a generic pathname could be done without the necessity of platform-specific code, but I have not idea how to do it.

Thanks in advance.

Community
  • 1
  • 1
eang
  • 1,615
  • 7
  • 21
  • 41
  • 1
    I believe, but not 100% sure, that Windows doesn't allow UTF-8 in filenames - only Unicode-16 (meaning the first 65535 characters in Unicode). UTF-8 may well render correctly to the screen [under the right circumstances], but is not the right thing. It would appear that at least sometimes g++ does a better job of translating UTF to unicode... – Mats Petersson Jan 25 '13 at 18:20
  • MS Windows is based on UTF-16, so it should be fully Unicode-compliant. – Ulrich Eckhardt Jan 25 '13 at 19:09
  • Windows APIs (OS level) all use UTF-16 natively. So ultimately, your filename is going to get to Windows OS in UTF-16. However, from your source code to the OS it is going to be handled by a few different layers. The only one that is likely to survive correctly is the L"name" variety, because otherwise your characters are indeed undefined behavior when used in a literal string (i.e. how the compiler encodes them is vendor specific, so what gets to the underlying OS layer depends on the compiler). – Mordachai Jan 25 '13 at 21:05

2 Answers2

3

You write:

“The source file is saved with UTF-8 encoding.”

Well that’s all well and good (so far) if you’re using the g++ compiler, which has UTF-8 as its default basic source character set. However, Visual C++ will by default assume that the source file is encoded in Windows ANSI, unless it's clearly otherwise. So make very sure that it has a BOM (Byte Order Mark) at the start, which – undocumented, as far as I know – causes Visual C++ to treat it as encoded with UTF-8.

You then ask,

“1. Why fopen() and std::ifstream works in a similar test in Linux but they don't in Windows?”

For Linux it's likely to work because (1) modern Linux is UTF-8 oriented, so if the filename looks the same it is likely the same as the identical looking UTF-8 encoded filename in the source code, and (2) in *nix a filename is just a sequence of bytes, not a sequence of characters. Which means that regardless of how it looks, if you pass the identical sequence of bytes, the same values, then you have a match, otherwise not.

In contrast, in Windows a filename is a sequence of characters that can be encoded in various ways.

In your case the UTF-8 encoded filename in the source code is stored as Windows ANSI in the executable (and yes, the result of building with Visual C++ depends on the selected ANSI codepage in Windows, which also as far as I know is undocumented). Then this gobbledegook string is passed down a routine hierarchy and converted to UTF-16, which is the standard character encoding in Windows. The result doesn't match the filename at all.


You further ask,

“2. Why CreateFileW() works only compiling with g++?”

Presumably because you did not include a BOM at the start of the sourc code file (see above).

With a BOM everything works nicely with Visual C++, at least in Windows 7:

File opened with fopen!
File opened with c++ open()
File opened with CreateFile!

Finally, you ask,

“3. Does exist a cross-platform alternative to CreateFile?”

Not really. There is Boost filesystem. But while its version 2 did have a workaround for the standard library's lossy narrow character based encoding, that workaround was removed in version 3, which just uses a Visual C++ extension of the standard library where Visual C++ implementation provides wide character argument versions of the stream constructors and open. I.e., at least as far as I know (I haven't checked lately if things have been fixed), Boost filesystem only works in general with Visual C++, not with e.g. g++ – although it works for no-troublesome-characters filenames.

The workaround that v2 had, was to try with conversion to Windows ANSI (the codepage specified by the GetACP function), and if that didn't work, try GetShortPathName, which is practically guaranteed to be representable with Windows ANSI.

Part of the reason that the workaround in Boost filesystem was removed was, as I understand it, that it's in principle possible for the user to turn off the Windows short name functionality at least in Windows Vista and earlier. However that's not a practical concern. It just means that there is an easy fix available (namely turn it back on) if the user experiences problems due to having wilfully lobotomized the system.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • Thank you very much. In Visual Studio I have changed the encoding from UTF-8 *without digital signature* to UTF-8 *with digital signature* and now it works with all the functions. Unfortunately this is not true if I compile with g++, because only CreateFileW() succeeds with opening. – eang Jan 26 '13 at 11:07
1

The problem you're stumbling over is that the encoding you pass to fstreams as path is implementation-specific. Further, the behaviour of your program is implementation-defined because it uses characters outside of C++'s characterset in the code, i.e. the accented characters. The problem there is that there are many different encodings that can be used to represent these characters.

Now, there are solutions:

  • Firstly, there is an MSC extension to tell the compiler which encoding it should assume.
  • In order to get a path working with CreateFileW(), you can code the path like wchar_t const path[] = {'f', 0x20ac, '.', 't', 'x', 't'};. This is not really comfortable, but in practice the paths are stored in files with some Unicode encoding or input by the user.
  • Then, there is an extension in the implementation of the standard library that allows you to use wchar_t paths, there are both _wfopen() and fstream constructors.
  • Then, there is Boost, which has a filesystem and iostream library that is specifically made to provide portable. I would definitely look at this.

Note that while the wchar_t paths are not portable, porting them to a new platform is usually not very complicated. A few #ifdefs and you're set.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
  • Sorry, your comment is not helpful. Unless it changed with C++11, the language defines which characters C++ source code may consist of, this is called the "basic source character set" and it is a subset of ASCII. If you use characters outside of this set, your code becomes nonportable, as demonstrated. As I read it, it is not "ill-formed" in the way that the standard uses this term but "implementation-defined". Was that possibly the nit you intended to pick or was there any other substance behind your claim? – Ulrich Eckhardt Jan 25 '13 at 20:35
  • the reason for the downvote rather than just a comment was that you're *wrong*. it's not just a nit. and you're trying to weasel out of it, which is pretty annoying. – Cheers and hth. - Alf Jan 25 '13 at 20:56
  • So why not add something substantially? Just saying "you're wrong" is not helpful. – Ulrich Eckhardt Jan 25 '13 at 20:57
  • the substance is that you're wrong. that your claim is incorrect. that it's not how you say it is. that it's wrong and should not be taken seriously by anybody. is that difficult to understand? if it isn't, what substance are you asking about? – Cheers and hth. - Alf Jan 25 '13 at 20:58
  • Yes, I'm pretty sure the OP did not understand in which subtle way it is wrong and I didn't either, I merely guessed that you were picking on the small standardese distinction between two evils. Further, in addition to not being helpful, your wording is impolite. – Ulrich Eckhardt Jan 25 '13 at 21:03
  • it pretty deceitful to have the downvote comment, quoting the erroneous claim, removed. – Cheers and hth. - Alf Jan 25 '13 at 22:27
  • I just flagged it as offensive, wasn't even aware that it would vanish. Rationale: Calling things "bullshit" as you did without any further justification is not helpful. Even on explicit request to clarify, you didn't provide anything with substance. Now again, you call my actions "deceitful". However, this time I'll let it stand there, for everybody to see and make their own judgement on your behaviour. – Ulrich Eckhardt Jan 25 '13 at 22:51
  • making folks believe that i wronged you doesn't make your answer right, and doesn't right your lies about the comment not being substantive. it **quoted** the bullshit claim. this connection is too slow to load your answer for editing so i cannot quote it again, but anyone with faster connection can look it up. in short you were bullshitting then, and you're bullshitting now. – Cheers and hth. - Alf Jan 25 '13 at 22:55
  • Again, you're just attacking me personally. If you had just taken the time to explain, I wouldn't have minded the "bullshit" that came with it, because I could accept the technical statement at its core, learn from it, evolve with it, ignore the ugly packaging it came in. – Ulrich Eckhardt Jan 26 '13 at 00:39