3

The task at hand

I'm parsing a filename from an UTF-8 encoded XML on Windows. I need to pass that filename on to a function that I can't change. Internally it uses _fsopen() which does not support Unicode strings.

Current approach

My current approach is to convert the filename to the user's charset hoping that the filename is representable in that encoding. I'm then using boost::locale::conv::from_utf() to convert from UTF-8 and I'm using boost::locale::util::get_system_locale() to get the name of the current locale.

Life is good?

I'm on a German system using code page Windows-1252 thus get_system_locale() correctly yields de_DE.windows-1252. If I test the approach with a filename containing an umlaut everything works as expected.

The Problem

Just to make sure I switched my system locale to Ukrainian which uses code page Windows-1251. Using some Cyrillic letter in the filename my approach fails. The reason is that get_system_locale() still yields de_DE.windows-1252 which is now incorrect.

On the other side GetACP() correctly yields 1252 for the German locale and 1251 for the Ukrainian locale. I also know that Boost.Locale can convert to a given locale as this small test program works as I expect:

#include <boost/locale.hpp>
#include <iostream>
#include <string>
#include <windows.h>

int main()
{
    std::cout << "Codepage: " << GetACP() << std::endl;
    std::cout << "Boost.Locale: " << boost::locale::util::get_system_locale() << std::endl;

    namespace blc = boost::locale::conv;
    // Cyrillic small letter zhe -> \xe6 (ш on 1251, æ on 1252)
    std::string const test1251 = blc::from_utf(std::string("\xd0\xb6"), "windows-1251");
    std::cout << "1251: " << static_cast<int>(test1251.front()) << std::endl;
    // Latin small letter sharp s -> \xdf (Я on 1251, ß on 1252)
    auto const test1252 = blc::from_utf(std::string("\xc3\x9f"), "windows-1252");
    std::cout << "1252: " << static_cast<int>(test1252.front()) << std::endl;

}

Questions

  • How can I query the name of the user locale in a format Boost.Locale supports? Using std::locale("").name() yields German_Germany.1252, using it results in a boost::locale::conv::invalid_charset_error exception.

  • Is it possible that the system locale remains de_DE.windows-1252 although I'm supposedly changing it as local admin? Similarly system language is German although my account's language is English. (Log in screen is German until I log in)

  • should I stick with using short filenames? Does not seem to work reliably though.

Fine-print

  • Compiler is MSVC18
  • Boost is version 1.56.0, backend supposedly winapi
  • System is Win7, system language is German, user language English
Community
  • 1
  • 1
Brandlingo
  • 2,817
  • 1
  • 22
  • 34

2 Answers2

2

ANSI is deprecated so don't bother with it.

Windows uses UTF16, you must convert from UTF8 to UTF16 using MultiByteToWideChar. This conversion is safe.

std::wstring getU16(const std::string &str)
{
    if (str.empty()) return std::wstring();
    int sz = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), 0, 0);
    std::wstring res(sz, 0);
    MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &res[0], sz);
    return res;
}

You then use _wfsopen (from the link you provided) to open file with UTF16 name.

int main()
{
    //UTF8 source:
    std::string filename_u8;

    //This line works in VS2015 only
    //For older version comment out the next line, obtain UTF8 from another source
    filename_u8 = u8"c:\\test\\__ελληνικά.txt";

    //convert to UTF16
    std::wstring filename_utf16 = getU16(filename_u8);

    FILE *file = NULL;
    _wfopen_s(&file, filename_utf16.c_str(), L"w");
    if (file)
    {
        //Add BOM, optional...

        //Write the file name in to file, for testing...
        fwrite(filename_u8.data(), 1, filename_u8.length(), file);

        fclose(file);
    }
    else
    {
        cout << "access denined, or folder doesn't exits...
    }

    return 0;
}


Edit, getting ANSI from UTF8, using GetACP()
std::wstring string_to_wstring(const std::string &str, int codepage)
{
    if (str.empty()) return std::wstring();
    int sz = MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), 0, 0);
    std::wstring res(sz, 0);
    MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), &res[0], sz);
    return res;
}

std::string wstring_to_string(const std::wstring &wstr, int codepage)
{
    if (wstr.empty()) return std::string();
    int sz = WideCharToMultiByte(codepage, 0, &wstr[0], (int)wstr.size(), 0, 0, 0, 0);
    std::string res(sz, 0);
    WideCharToMultiByte(codepage, 0, &wstr[0], (int)wstr.size(), &res[0], sz, 0, 0);
    return res;
}

std::string get_ansi_from_utf8(const std::string &utf8, int codepage)
{
    std::wstring utf16 = string_to_wstring(utf8, CP_UTF8);
    std::string ansi = wstring_to_string(utf16, codepage);
    return ansi;
}
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
  • Under the assumption that I cannot change the function this sadly does not help. But this is still promising as the interface is too stable for change but the function only requires effort. Thanks, I'll give it a try. – Brandlingo Jun 30 '16 at 06:24
  • The problem you describe is rather complicated, it's part of the reason Unicode was invented in the first place. I added a function to get ANSI from UTF8, it's sort of what Iverelo suggested. See also this [link](http://stackoverflow.com/a/30229657/4603670) about system language, I am not sure if that helps though. – Barmak Shemirani Jul 02 '16 at 05:50
  • Fortunately I managed to adopt your first suggestion. It allowed to keep the stable interface but change the internals. I used `boost::locale::conv::utf_to_utf()` instead of your `getU()` though. – Brandlingo Jul 04 '16 at 05:59
  • Unfortunately there still is no answer to the actual question. But you now provide a way to work around the boost limitation so I'll accept this answer. – Brandlingo Jul 04 '16 at 06:01
2

Barmak's way is the best way to do it.

To clear up the locale stuff, the process always starts with the "C" locale. You can use the setlocale function to set the locale to the system default or any arbitrary locale.

#include <clocale>

// Get the current locale
setlocale(LC_ALL,NULL);

// Set locale to system default
setlocale(LC_ALL,"");

// Set locale to German
setlocale(LC_ALL,"de-DE");
Iverelo
  • 146
  • 1
  • 8
  • Thanks for your answer. The problem is that the locale overloads of the conversion functions don't work with the standard locales. And the charset-as-string overloads fail with the names of these locales even when striping the language_territory part. – Brandlingo Jun 30 '16 at 06:18
  • The conversion functions you mention are still the boost functions? Unfortunately I do not have a lot of experience with boost locale functions. The trick I have used to go from one encoding to another in the past on Windows is to use [MultiByteToWideChar](https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx) to get to wide characters and then [WideCharToMultiByte](https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx) to get back to a different encoding. – Iverelo Jun 30 '16 at 16:36
  • Yes, still the boost functions. I guess internally they do the same, the key question is how the MS uint code page ID is mapped to boost code page strings. – Brandlingo Jun 30 '16 at 17:35