9

I'm working on a library (pugixml) that, among other things, provides file load/save API for XML documents using narrow-character C strings:

bool load_file(const char* path);
bool save_file(const char* path);

Currently the path is passed verbatim to fopen, which means that on Linux/OSX you can pass a UTF-8 string to open the file (or any other byte sequence that is a valid path), but on Windows you have to use Windows ANSI encoding - UTF-8 won't work.

The document data is (by default) represented using UTF-8, so if you had an XML document with a file path, you would not be able to pass the path retrieved from the document to load_file function as is - or rather, this would not work on Windows. The library provides alternative functions that use wchar_t:

bool load_file(const wchar_t* path);

But using them requires extra effort for encoding UTF8 to wchar_t.

A different approach (that is used by SQlite and GDAL - not sure if there are other C/C++ libraries that do that) involves treating the path as UTF-8 on Windows (which would be implemented by converting it to UTF-16 and using a wchar_t-aware function like _wfopen to open the file).

There are different pros and cons that I can see and I'm not sure which tradeoff is best.

On one hand, using a consistent encoding on all platforms is definitely good. This would mean that you can use file paths extracted from the XML document to open other XML documents. Also if the application that uses the library adopts UTF-8 it does not have to do extra conversions when opening XML files through the library.

On the other hand, this means that behavior of file loading is no longer the same as that of standard functions - so file access through the library is not equivalent to file access through standard fopen/std::fstream. It seems that while some libraries take the UTF-8 path, this is largely an unpopular choice (is this true?), so given an application that uses many third-party libraries, it may increase confusion instead of helping developers.

For example, passing argv[1] into load_file currently works for paths encoded using system locale encoding on Windows (e.g. if you have a Russian locale you can load any files with Russian names like that, but you won't be able to load files with Japanese characters). Switching to UTF-8 will mean that only ASCII paths work unless you retrieve the command-line arguments in some other Windows-specific way.

And of course this would be a breaking change for some users of the library.

Am I missing any important points here? Are there other libraries that take the same approach? What is better for C++ - being consistently inconsistent in file access, or striving for uniform cross-platform behavior?

Note that the question is about the default way to open the files - of course nothing prevents me from adding another pair of functions with _utf8 suffix or indicating the path encoding in some other way.

zeuxcg
  • 9,216
  • 1
  • 26
  • 33
  • 2
    Three things: (1) Why not just convert to UTF-16 internally, then use `_wfopen`/`std::ifstream(wchar_t *)` on Windows? The resulting file object is the same as the one opened by the non-`wchar` functions. (2) Have you read http://utf8everywhere.org, and do you agree with it? (3) See http://stackoverflow.com/questions/11107608/whats-wrong-with-c-wchar-t-and-wstrings-what-are-some-alternatives-to-wide. – nneonneo Jun 27 '15 at 19:07
  • 1) that's exactly how the second approach would work 2) I've read it. I agree that UTF-8 is generally superior in a cross-platform application, but a library may be different - does the world think the same? :) – zeuxcg Jun 27 '15 at 19:09
  • Another library that uses UTF-8 (and no wide-chars) on Windows: gtkmm. Though it does other crimes. – Yakov Galka Jun 28 '15 at 10:10
  • It should be noted that Microsoft is pushing the FileSystem TS *hard*, for precisely these reasons. They even implemented support for Boost.FileSystem v2, just so they could have *something* there in VS2013. Now that the FSTS is done, they have the first implementation among major compiler/standard library vendors. They want to see this problem ended just as much as everyone else. – Nicol Bolas Jan 02 '16 at 13:50

1 Answers1

8

There's a growing belief that you should aim for UTF-8 only in cross-platform code, and perform conversions automatically in Windows where appropriate. utf8everywhere gives a good rundown of the reasons to prefer UTF-8 encoding.

As a recent example, libtorrent deprecated all the routines that handle wchar_t filenames, and instead asks library users to use their wchar_t-to-utf8 conversion functions before passing in filenames.

Personally, the strongest reason I would have to avoid wchar_t/wstring functions is simply to avoid having duplication of my API. Keeping the number of functions in the API down, to reduce external maintenance, documentation, and code path duplication costs is valuable. Details can be worked out internally. The mess of duplicated APIs caused by the Windows ANSI/Unicode split is probably lesson enough to avoid this in your own APIs.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • 1
    2nd on that it is indeed best possible approach that makes stuff much simpler. – Artyom Jun 28 '15 at 05:15
  • 2
    Yes, more specifically on Windows, **convert the file name to wide characters and use the wide API**. There's so many libraries out there which just pass a `const char*` to `fopen(...)`, which will effectively make it impossible to open files with arbitrary filenames (i.e. with characters outside the current code page). – roeland Jun 29 '15 at 06:11