4

I'm contributing to a C library. It has a function that takes a char* parameter for a file path name. The authors are mostly UNIX developers, and this works fine on unixes where char* mostly means UTF-8. (At least in GCC, the character set is configurable and UTF-8 is the default.)

However, char* means ANSI on Windows, which implies that it is currently impossible to use Unicode path names with this library on Windows, where wchar_t* should be used and only UTF-16 is supported. (A quick search on StackOverflow reveals that the ANSI Windows API functions can not be used with UTF-8.)

The question is, what is the right way to deal with this? We've come up with various ways to do it, but neither of us are Windows experts, so we can't really decide how to do it properly. Our goal is that the users of the library should be able to write cross-platform code that would work on unixes as well as windows.

Under the hood, the library has #ifdefs in place to differentiate between operating systems so that it can use POSIX functions on UNIXes and Win32 APIs on Windows.

So far, we've come up with the following possibilities:

  1. Offer a separate windows-only function that accepts a wchar_t*.
  2. Require UTF-16 on Windows and #ifdef the library header in such a way that the function would accept wchar_t* on Windows.
  3. Add a flag that would tell the function to cast the given char* to wchar_t* and call the widechar Windows APIs.
  4. Create a variant of the function that takes a file descriptor (or file handle on Windows) instead of a file path.
  5. Always require UTF-8 (even on Windows), and then inside the function, convert UTF-8 to UTF-16 and call the widechar Windows APIs.

The problem with options 1-4 is that they would require the user to consciously take care of portability themselves. Option 5 sounds good, but I'm not sure if this is the right way to go.

I'm also open to other suggestions or ideas that can solve this. :)

Community
  • 1
  • 1
Venemo
  • 18,515
  • 13
  • 84
  • 125
  • 2
    I'm for #5, using CP65001. Admittedly, you are better off doing the transcoding manually on the WINAPI-boundary, as MS support is broken by intent, but that's not hard. – Deduplicator Jan 30 '15 at 17:16
  • 2
    @Deduplicator Relevant: http://www.nubaria.com/en/blog/?p=289 "The reason for this is that the standard non-wide functions and classes that deal with the file system, such as fopen and std::fstream, assume that the char-based string is encoded using the local system’s code page, and there is no way to make them work with UTF-8 strings. Note that at this moment (2011), Microsoft does not allow to set a UTF-8 locale using the set_locale C function." – Karol S Jan 30 '15 at 17:20
  • 3
    You might want to read http://utf8everywhere.org/#how (which is basically #5). – cremno Jan 30 '15 at 17:24
  • 1
    If your library requires UTF-8 paths, it requires UTF-8 paths. Document it. Then on Windows specifically, you'd likely be forced to convert the path to `wchar_t *` using `MultiByteToWideChar`, use `CreateFileW` to open the file, `_open_osfhandle` to associate a POSIX file descriptor with the returned `HANDLE`, and use `_fdopen` to get a `FILE *` from the newly opened file descriptor. I'm not sure you have much choice there, and you might just avoid the headache altogether and ask the user to handle it by requiring an open `FILE *` parameter. Inconvenient, but very portable. –  Jan 31 '15 at 12:36

1 Answers1

2

Since portability is an important goal for you, I think it is imperative for your function semantics to be precisely defined. Among other things, that means that the arguments' types and meanings don't vary across platforms. So, if you have a function that accepts regular char based paths then it should accept such paths on all systems, and the encoding expected of those paths should be well-defined (which does not necessarily mean "the same"). That rules out options (2) and (3).

Moreover, portability requires the same functions to be usable across all platforms; that rules out (1). Option (4) could be ok if a stream- and/or file descriptor-based approach were the only one provided by your library, but it yields portability only with respect to those functions, not with respect to the path-based ones. (And note that stream (FILE *) APIs are defined by C, whereas file descriptors are a POSIX concept, not native to C. In principle, therefore, streams are more portable than file descriptors.)

(5) could work, but it places stronger constraints than you actually need. It is not essential for the function to define the encoding expected (though it can); it suffices for it to define how that encoding is determined.

Additionally, you could add wchar_t-based functions that work everywhere (as opposed to Windows-only). Those might be more convenient for Windows users. Similar to alternative (4), however, that provides portability only with respect to those functions. Supposing that you don't want to drop the char-based ones, you would need to pair this alternative with some variation on (5).

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • We don't want to drop the `char`-based APIs and would prefer to choose a solution that requires little or no duplication in the code, so I wouldn't want to implement a suite of `wchar_t`-based functions. As far as I know, the POSIX APIs used by the library wouldn't even work with `wchar_t`. – Venemo Jan 30 '15 at 17:44
  • 2
    If you were to provide separate `wchar_t`-based functions then they might simply serve as wrappers that perform encoding translation before calling the `char`-based functions. But that's all a convenience feature, not the essential portability solution. – John Bollinger Jan 30 '15 at 17:48