I'm working on a library (pugixml) that, among other things, provides file load/save API for XML documents using narrow-character C strings:
bool load_file(const char* path);
bool save_file(const char* path);
Currently the path is passed verbatim to fopen
, which means that on Linux/OSX you can pass a UTF-8 string to open the file (or any other byte sequence that is a valid path), but on Windows you have to use Windows ANSI encoding - UTF-8 won't work.
The document data is (by default) represented using UTF-8, so if you had an XML document with a file path, you would not be able to pass the path retrieved from the document to load_file
function as is - or rather, this would not work on Windows. The library provides alternative functions that use wchar_t
:
bool load_file(const wchar_t* path);
But using them requires extra effort for encoding UTF8 to wchar_t.
A different approach (that is used by SQlite and GDAL - not sure if there are other C/C++ libraries that do that) involves treating the path as UTF-8 on Windows (which would be implemented by converting it to UTF-16 and using a wchar_t
-aware function like _wfopen
to open the file).
There are different pros and cons that I can see and I'm not sure which tradeoff is best.
On one hand, using a consistent encoding on all platforms is definitely good. This would mean that you can use file paths extracted from the XML document to open other XML documents. Also if the application that uses the library adopts UTF-8 it does not have to do extra conversions when opening XML files through the library.
On the other hand, this means that behavior of file loading is no longer the same as that of standard functions - so file access through the library is not equivalent to file access through standard fopen
/std::fstream
. It seems that while some libraries take the UTF-8 path, this is largely an unpopular choice (is this true?), so given an application that uses many third-party libraries, it may increase confusion instead of helping developers.
For example, passing argv[1]
into load_file
currently works for paths encoded using system locale encoding on Windows (e.g. if you have a Russian locale you can load any files with Russian names like that, but you won't be able to load files with Japanese characters). Switching to UTF-8 will mean that only ASCII paths work unless you retrieve the command-line arguments in some other Windows-specific way.
And of course this would be a breaking change for some users of the library.
Am I missing any important points here? Are there other libraries that take the same approach? What is better for C++ - being consistently inconsistent in file access, or striving for uniform cross-platform behavior?
Note that the question is about the default way to open the files - of course nothing prevents me from adding another pair of functions with _utf8 suffix or indicating the path encoding in some other way.