5

I'm in the process of fixing a large open source cross-platform application such that it can handle file paths containing non-ANSI characters on Windows.


Update:

Based on answers and comments I got so far (thanks!) I feel like I should clarify some points:

  1. I cannot modify the code of dozens of third party libraries to use std::wchar_t. This is just not an option. The solution has to work with plain ol' std::fopen(), std::ifstream, etc.

  2. The solution I outline below works at 99%, at least on the system I'm developing on (Windows 10 version 1909, build 18363.535). I haven't tested on any other system yet.

    The only remaining issue, at least on my system, is basically number formatting and I'm hopeful that replacing the std::numpunct facet does the trick (but I haven't succeeded yet).


My current solution involves:

  1. Setting the C locale to .UTF-8 for the LC_CTYPE category on Windows (all other categories are set to the C locale as required by the application):

    // Required by the application.
    std::setlocale(LC_ALL, "C");
    
    // On Windows, we want std::fopen() and other functions dealing with strings
    // and file paths to accept narrow-character strings encoded in UTF-8.
    #ifdef _WIN32
    {
    #ifndef NDEBUG
        char* new_ctype_locale =
    #endif
            std::setlocale(LC_CTYPE, ".UTF-8");
        assert(new_ctype_locale != nullptr);
    }
    #endif
    
  2. Configuring boost::filesystem::path to use the en_US.UTF-8 locale so that it too can deal with paths containing non-ANSI characters:

    boost::filesystem::path::imbue(std::locale("en_US.UTF-8"));
    

The last missing bit is to fix file I/O using C++ streams such as

std::ifstream istream(filename);

The simplest solution is probably to set the global C++ locale at the beginning of the application:

std::locale::global(std::locale("en_US.UTF-8"));

However that messes up formatting of numbers, e.g. 1234.56 gets formatted as 1,234.56.

Is there a locale that just specifies the encoding to be UTF-8 without messing with number formatting (or other things)?

Basically I'm looking for the C.UTF-8 locale, but that doesn't seem to exist on Windows.

Update: I suppose one solution would be to reset some (most? all?) of the facets of the locale, but I'm having a hard time finding information on how to do that.

Community
  • 1
  • 1
François Beaune
  • 4,270
  • 7
  • 41
  • 65

2 Answers2

3

Windows API does not respect the CRT locales, and the CRT implementation of fopen etc. directly call the narrow-char API, therefore changing the locale will not affect the encoding.

However, Windows 10 May 2019 Update (version 1903) introduced a support for UTF-8 in its narrow-char APIs. It can be enabled by embedding an appropriate manifest into your executable. Unfortunately it's a very recent addition, and so might not be an option if you need to target older systems.

Your other options include converting manually to wchar_t or using a layer that does that for you (like Boost.Filesystem, or even better, Boost.Nowide).

Yakov Galka
  • 70,775
  • 16
  • 139
  • 220
  • That's not at all my experience. I seem to be able to open file paths encoded in UTF-8 just fine as long as I set the C locale (for `fopen()`) and C++ locale (for C++ streams) to UTF-8. Am I missing something? – François Beaune Jan 09 '20 at 07:53
  • "Windows API does not respect the CRT locales": Note that at no point did I mention Windows APIs. The app is cross-platform and doesn't (directly) use any platform-specific API. – François Beaune Jan 09 '20 at 07:57
  • 2
    It is likely that the C and C++ library features you're using boil down to Windows API functions. How else could they work? – Lightness Races in Orbit Jan 09 '20 at 10:57
  • @FrançoisBeaune: at least prior to Windows 10 MSVCRT's `fopen` called `CreateFileA` directly, consequently it could not support UTF-8. If your report of `fopen` respecting the locale is indeed true, then it must be a new behavior they introduced recently that I'm not aware of. Regardless of that, my answer still holds -- instead of tweaking the locale for each library (std, boost, C) independently, which breaks your formatting, you should leave the locales as-is and embed a manifest declaring UTF-8 codepage as described above. If you can rely on the new `fopen` behavior then you can rely on – Yakov Galka Jan 09 '20 at 17:41
  • that feature too. Note also that you shouldn't change locales because you should respect the user's settings. On the other hand you cannot rely on numbers being formatted in a specific way either, because with other locales they may be formatted differently (e.g. use comma as a fractional separator). – Yakov Galka Jan 09 '20 at 17:42
  • @ybungalobill Thanks for the details. Isn't the underlying implementation of `fopen()` bound to the selected VS' toolset rather than Windows version? – François Beaune Jan 10 '20 at 11:01
  • I wasn't aware of the manifest solution. It seems to work nicely. I'm not quite sure to understand the behavior on older versions of Windows 10 (< 1903) and older versions of Windows. The page you linked to says: `You can declare this property and target/run on earlier Windows builds, but you must handle legacy code page detection and conversion as usual.` What do they mean by that? – François Beaune Jan 12 '20 at 11:48
  • 1
    @FrançoisBeaune: 1) starting with Visual Studio 2015 there is a Universal CRT that's common to all Visual Studio versions. This is how it worked prior to Visual Studio .Net (2003), and it's great to see that they went back to that model. 2) on older Windows versions the narrow char interfaces will ignore the manifest and assume the legacy encoding. They want you to convert and call the wide interface instead. You say you cannot modify the 3rd party libraries, but then there's no way to support Unicode with those on older Windows versions. – Yakov Galka Jan 13 '20 at 21:46
  • Thanks for the clarification @ybungalobill. I went with the manifest version as it requires essentially zero modification in the code. We will only support non-ANSI chars in file paths starting with Windows 1903 and just ignore older versions of Windows... – François Beaune Jan 14 '20 at 14:21
2

Never mind locales.

On Windows you should use Microsoft's extension that adds a constructor taking const std::wchar_t* (expected to point to UTF-16) to std::ifstream.

Hopefully all your strings are UTF-8, or otherwise some consistent and sane encoding.

So just grab a UTF-8 → UTF-16 converter (they're lightweight) and pass filenames to std::ifstream as UTF-16 (in a std::wchar_t*).

(Be sure to #ifdef it out so it doesn't get attempted on any other platform.)

You should also use _wfopen instead of std::fopen, in the same way, for the same reason.

That's it.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 2
    Yeah, Win32 and MSVCRT don't really let you do anything useful with UTF-8 text other than convert it to and from Windows' native UTF-16 encoding, so you're pretty much forced to use wide characters for stuff like filenames. – dan04 Jan 08 '20 at 23:54
  • 3
    As of Windows 10 May 2019 update, [UTF-8 can be set as the process codepage](https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page), so one can use the narrow char API with UTF-8. – Yakov Galka Jan 09 '20 at 01:51
  • @ybungalobill Didn't say it's impossible. Doesn't make it the best solution. – Lightness Races in Orbit Jan 09 '20 at 01:52
  • 1
    It was a reply to @dan04 :) – Yakov Galka Jan 09 '20 at 02:02
  • @ybungalobill Oh. :) – Lightness Races in Orbit Jan 09 '20 at 02:08
  • I unfortunately _cannot_ modify the code of the dozen third party libraries that this app builds on so I'll have to make `fopen()` and C++ streams work with UTF-8. As far as I can tell, my solution works perfectly (granted, I only tested on my computer so far), if it weren't for messed up number formatting. – François Beaune Jan 09 '20 at 07:54
  • @ybungalobill: Glad that Microsoft *finally* made this change. It's been such a PITA to write cross-platform code all these years having to manually do all the string conversions on Windows. – dan04 Jan 09 '20 at 14:26