10

I'm reading gzip compressed files using zlib. Then you open a file using

gzFile gzopen(const char *filepath, const char *mode);

How do you handle Unicode file paths that are stored as const wchar_t* on Windows?

On UNIX-like platforms you can just convert the file path to UTF-8 and call gzopen(), but that will not work on Windows.

Johan Råde
  • 20,480
  • 21
  • 73
  • 110
  • Not sure, but I'd expect it to accept UTF8, so that you can convert your UTF16 into UTF8 and pass the result as `char*`. – sharptooth Mar 15 '12 at 09:39
  • Have you tried using [wcstombs](http://www.cplusplus.com/reference/clibrary/cstdlib/wcstombs/) or [iconv](http://www.gnu.org/software/libiconv/) ? – Appleman1234 Mar 15 '12 at 12:55
  • 3
    @Appleman: On Windows wcstombs will, at least by default, convert the string to Windows-1252. Characters that can not be represented as Windows-1252 will be replaced by various substitution characters. If that happens the converted string can not be used as a file path. – Johan Råde Mar 15 '12 at 13:21
  • That's what the console SVN client on Windows does, apparently. And that makes working with Unicode filenames really painful ;) – Joey Mar 15 '12 at 16:30
  • Hard to see this issue, this is a problem for the guy that *created* the gzip file. File names are encoded in ISO 8859-1. Or whatever the app used that created the file, a common problem. – Hans Passant Mar 15 '12 at 16:52
  • 1
    @Hans Passant: I'm writing a library whose interface takes a file path as a boost::filesystem::path and whose implementation may read the file using the ZLib library. Then this is an issue. – Johan Råde Mar 15 '12 at 17:01
  • So just convert the string from wide to narrow. ICU for example. – Hans Passant Mar 15 '12 at 17:10
  • @Hans Passant: Narrow with what encoding? The default encoding on windows for narrow strings is Windows-1252 and that will not work. It can not handle most code points above 0xff. – Johan Råde Mar 15 '12 at 17:45
  • Quote: "file names are encoded in ISO 8859-1". – Hans Passant Mar 15 '12 at 17:46
  • 3
    @Hans Passant: I read that, but I did not understand what you mean. I can create files on my Windows computer with names such as "黒死.txt". And I can open that file by passing its name (as a UTF-16 encoded wide string) to _wfopen(...) – Johan Råde Mar 15 '12 at 17:52
  • @Hans Passant: How would I encode "黒死.txt" with ISO 8859-1? Obviously I'm missing some piece of information. Please enlighten me. – Johan Råde Mar 15 '12 at 17:57
  • Again, this is a problem for the guy that *creates* the gzip. You may readily assume that gzip isn't a very popular format in East Asia. – Hans Passant Mar 15 '12 at 18:00
  • The issue is that some library in between your code and the OS needs a `char *`. (at least that's my problem and why I'm here today). So there must be a way to do `wchar*` → [lib space] `char*` → `fopen` → [os space] `_wfopen`, with the final _wfopen having a reconstruction of the original string. so the question is, what is the inverse function of dan04's `ToUTF16` ? is it `wcstombs` ? during the chain between my-code and os-space, there is no need to be able to interpret the string as glyphs, so non encodable characters can be kept as mbs and reconstructed by ToUTF16. – v.oddou May 18 '17 at 02:39

5 Answers5

15

The next version of zlib will include this function where _WIN32 is #defined:

gzFile gzopen_w(const wchar_t *path, char *mode);

It works exactly like gzopen(), except it uses _wopen() instead of open().

I purposely did not duplicate the second argument of _wfopen(), and as a result I did not call it _wgzopen() to avoid possible confusion with that function's arguments. Hence the name gzopen_w(). That also avoids the use of the C-reserved name space.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • 2
    Best kind of answer. You ask how to do it and author of the library comes with a new feature. – Erkin Alp Güney May 16 '18 at 11:50
  • @ErkinAlpGüney I disagree, a better option would be using the issue tracker. BTW: This *next version* was **1.2.7 (2 May 2012)** (see [first commit on that](https://github.com/madler/zlib/commit/dbe0bed739c26a2c36319794108cb87ad77c5469#diff-e49fc09e36fee921723dfe1fe9f7c4c4d5017b2368b585c826068111f5347955), there is no corresponding entry in the issue tracker) – Wolf Aug 11 '21 at 10:20
12

First of all, what is a filename?

On Unix-like systems

A filename is a sequence of bytes terminated by zero. The kernel doesn't need to care about character encoding (except to know the ASCII code for /).

However, it's more convenient from the users' point of view to interpret filenames as sequences of characters, and this is done by a character encoding specified as part of the locale. Unicode is supported by making UTF-8 locales available.

In C programs, files are represented with ordinary char* strings in functions like fopen. There is no wide-character version of the POSIX API. If you have a wchar_t* filename, you must explicitly convert it to a char*.

On Windows NT

A filename is a sequence of UTF-16 code units. In fact, all string manipulation in Windows is done in UTF-16 internally.

All of Microsoft's C(++) libraries, including the Visual C++ runtime library, use the convention that char* strings are in the locale-specific legacy "ANSI" code page, and wchar_t* strings are in UTF-16. And the char* functions are just backwards-compatibility wrappers around the new wchar_t* functions.

So, if you call MessageBoxA(hwnd, text, caption, type), that's essentially the same as calling MessageBoxW(hwnd, ToUTF16(text), ToUTF16(caption), type). And when you call fopen(filename, mode), that's like _wfopen(ToUTF16(filename), ToUTF16(mode)).

Note that _wfopen is one of many non-standard C functions for working with wchar_t* strings. And this isn't just for convenience; you can't use the standard char* equivalents because they limit you to the "ANSI" code page (which can't be UTF-8). For example, in a windows-1252 locale, you can't (easily) fopen the file שלום.c, because there's just no way to represent those characters in a narrow string.

In cross-platform libraries

Some typical approaches are:

  1. Use Standard C functions with char* strings, and just don't give a about support for non-ANSI characters on Windows.
  2. Use char* strings but interpret them as UTF-8 instead of ANSI. On Windows, write wrapper functions that take UTF-8 arguments, convert them to UTF-16, and call functions like _wfopen.
  3. Use wide character strings everywhere, which is like #2 except that you need to write wrapper functions for non-Windows systems.

How does zlib handle filenames?

Unfortunately, it appears to use the naïve approach #1 above, with open (rather than _wopen) used directly.

How can you work around it?

Besides the solutions already mentioned (my favorite of which is Appleman1234's gzdopen suggestion), you could take advantage of symbolic links to give the file an alternative all-ASCII name which you could then safely pass to gzopen. You might not even have to do that if the file already has a suitable short name.

Community
  • 1
  • 1
dan04
  • 87,747
  • 23
  • 163
  • 198
  • 3
    So there are two approaches for zlib. Either a) have gzopen() always do the UTF-8 to UTF-16 conversion on Windows and use _wopen(), or b) leave gzopen() as is using open(), and add a new _wgzopen() function only on Windows that takes a UTF-16 argument and uses _wopen(). What would be dan04's recommendation? – Mark Adler Mar 16 '12 at 16:05
  • @Mark: You should not ask whether the gzip library should use this or that encoding. The library should not decide which encoding to use. That is the responsibility of the app that uses the library. The app usually does that by setting the current locale. The library should simply use the encodings specified by the current locale. The easiest way to do that, is to delegate to existing locale aware functions, as in your alternative b). – Johan Råde Mar 16 '12 at 17:33
  • 1
    @MarkAdler: For *me*, it would be more convenient if zlib used UTF-8, as this is what my team's coding standard requires (mainly for reasons of compatibility with other third-party libraries such as SQLite and TinyXML). Perhaps you could provide both UTF-8 and UTF-16 versions of the functions. – dan04 Mar 16 '12 at 19:15
  • 2
    Ok. So I could do both. gzopen() could convert from UTF-8 to UTF-16 and call _wopen when compiled in Windows. And there could also be a _wgzopen() that uses UTF-16 for input (for both arguments?). I don't get the whole "delegate to existing locale aware functions" thing. Does that mean that the routine that converts from UTF-8 to UTF-16 is not "locale aware"? By the way, what is that routine? – Mark Adler Mar 16 '12 at 20:33
  • 1
    @MarkAdler: That routine is `MultiByteToWideChar()` or `iconv()`. – dan04 Mar 16 '12 at 21:37
4

You have the following options

 #ifdef _WIN32 

 #define F_OPEN(name, mode) _wfopen((name), (mode))

 #endif    
  1. Patch zlib so that it uses _wfopen on Windows rather than fopen , using something similar to the above in zutil.h

  2. Use _wfopen or _wopen instead of gzopen, and pass the return value to gzdopen.

  3. Use libiconv or some other library to change the file enconding to ASCII from your given Unicode encoding, and pass the ASCII string to gzopen. If libiconv fails you handle the error and prompt the user to rename the file.

For more information regarding iconv , see An example of iconv. That example uses Japanese to UTF-8, but it wouldn't be a large leap to change the destination encoding to ASCII or ISO 8859-1.

For more information regarding zlib and non ANSI character conversion see here

Appleman1234
  • 15,946
  • 45
  • 67
  • Can anybody please help to decipher what the [*here*](http://board.zsnes.com/phpBB3/viewtopic.php?f=22&t=12061) link pointed to? – Wolf Aug 11 '21 at 09:35
  • The page doesn't seem to be archived on Archive.org, I believe it was a post on the old bsnes development forums, elaborating more on the work done under the line item - zlib modified to support non-ANSI characters from the bsnes changelog https://static.hexostum.net/bsnes/bsnes_changelog.txt – Appleman1234 Aug 12 '21 at 00:33
3

Here is an implementation of Appleman's option #2. The code has been tested.

#ifdef _WIN32

gzFile _wgzopen(const wchar_t* fileName, const wchar_t* mode)
{
    FILE* stream = NULL;
    gzFile gzstream = NULL;
    char* cmode = NULL;         // mode converted to char*
    int n = -1;

    stream = _wfopen(fileName, mode);

    if(stream)
        n = wcstombs(NULL, mode, 0);
    if(n != -1)
        cmode = (char*)malloc(n + 1);
    if(cmode) {
        wcstombs(cmode, mode, n + 1);
        gzstream = gzdopen(fileno(stream), cmode);
    }

    free(cmode);
    if(stream && !gzstream) fclose(stream);
    return gzstream;
}

#endif

I have made both filename and mode const wchar_t* for consistency with Windows functions such as

FILE* _wfopen(const wchar_t* filename, const wchar_t* mode);
Johan Råde
  • 20,480
  • 21
  • 73
  • 110
  • Tested this with visual studio 2010 compilation, in debug you get exception when application is about to terminate. This probably comes because file was opened using _wfopen, but after that handle gets closed by _close. It's possible that you can get "safe" implementation by dup:licating fileno, and then close:ing file, but I've attached my own implementation for this function below. – TarmoPikaro Oct 26 '15 at 20:29
  • There is some oddness when this is executed in application. Application terminates ok, but in tray icon I've noticed some bug report is being sent to microsoft automatically - this is first time I see "silent" bug report. Nothing is displayed to end-user. – TarmoPikaro Oct 26 '15 at 20:29
  • Debugged in vs2012, and after application terminates, exception is displayed - you are trying to terminate application, but debugger just hangs for half minute. Also vs needs to be restarted after this bug. – TarmoPikaro Oct 26 '15 at 20:30
1

Here is my own version of unicode helper function, tested slightly better than version above.

static void GetFlags(const char* mode, int& flags, int& pmode)
{
    const char* _mode = mode;

    flags = 0;      // == O_RDONLY
    pmode = 0;      // pmode needs to be obtained, otherwise file gets read-only attribute, see 
                    // http://stackoverflow.com/questions/1412625/why-is-the-read-only-attribute-set-sometimes-for-files-created-by-my-service

    for( ; *_mode ; _mode++ )
    {
        switch( tolower(*_mode) )
        {
            case 'w':
                flags |= O_CREAT | O_TRUNC;
                pmode |= _S_IWRITE;
                break;
            case 'a':
                flags |= O_CREAT | O_APPEND;
                pmode |= _S_IREAD | _S_IWRITE;
                break;
            case 'r':
                pmode |= _S_IREAD;
                break;
            case 'b':
                flags |= O_BINARY;
                break;
            case '+':
                flags |= O_RDWR;
                pmode |= _S_IREAD | _S_IWRITE;
                break;
        }
    }

    if( (flags & O_CREAT) != 0 && (flags & O_RDWR) == 0 )
        flags |= O_WRONLY;
} //GetFlags


gzFile wgzopen(const wchar_t* fileName, const char* mode)
{
    gzFile gzstream = NULL;
    int f = 0;
    int flags = 0;
    int pmode = 0;

    GetFlags(mode, flags, pmode);

    f = _wopen(fileName, flags, pmode );

    if( f == -1 )
        return NULL;

    // gzdopen will also close file handle.
    gzstream = gzdopen(f, mode);
    if(!gzstream)
        _close(f);
    return gzstream;
}
TarmoPikaro
  • 4,723
  • 2
  • 50
  • 62
  • 1
    Mark has added support for wide character file names on Windows to recent versions of zlib. – Johan Råde Oct 28 '15 at 19:22
  • I agree. But if you have zlib library already integrated and you don't want to bother with re-integration of new library (backwards compatibility, new features, etc...) - then it's easier just to wrap existing. – TarmoPikaro Oct 28 '15 at 19:53
  • You have to pass `O_BINARY` to `_wopen` in any case, not only if the uncompressed data is binary, otherwise it will corrupt any 0x10 in the compressed output! – Iziminza Aug 22 '21 at 14:16