1

Goal

Converting a wstring containing ÅåÄäÖöÆæØø into a string in C++.

Environment

C++17, Visual Studio Community 2017, Windows 10 Pro 64-bit

Description

I'm trying to convert a wstring to string and has implemented the solution suggested at https://stackoverflow.com/a/3999597/1997617

// This is the code I use:
// Convert a wide Unicode string to an UTF8 string
std::string toString(const std::wstring &wstr)
{
    if (wstr.empty()) return std::string();
    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
    std::string strTo(size_needed, 0);
    WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

So far so good.

My problem is that I have to handle scandinavian letters (ÅåÄäÖöÆæØø) in addition to the English ones. Regard the input wstring below.

L"C:\\Users\\BjornLa\\Å-å-Ä-ä-Ö-ö Æ-æ-Ø-ø\\AEther Adept.jpg"

When returned it has become...

"C:\\Users\\BjornLa\\Å-å-Ä-ä-Ö-ö Æ-æ-Ø-ø\\AEther Adept.jpg"

... which causes me some trouble.

Question

So I would like to ask an often asked question, but with a small addition:

How do I convert a wstring to a string when it contains Scandinavian characters?

Björn Larsson
  • 317
  • 1
  • 5
  • 19
  • 1
    Doesn't Windows use wide characters for file names, so that converting a wide character string to UTF-8 would actually make it stop working? I would usually try to avoid any processing of file names, and just keep the encoding the user has provided, assuming that file selection dialogs and such already use the correct encoding. – Erlkoenig Mar 06 '18 at 13:30
  • Agree with @Erlkoenig but anyway, when you say it has become "C:\\Users\\BjornLa\\Ã…-Ã¥-Ä-ä-Ö-ö Æ-æ-Ø-ø\\AEther Adept.jpg" are you using a renderer (e.g. debugger) that knows to use UTF-8? After all, `string` is a sequence of bytes that encode a text value. The renderer is showing it decoded bytes and, further, as a string literal so it isn't the "real" value. – Tom Blodget Mar 06 '18 at 23:18
  • @Erlkoenig: Perhaps I could rewrite my application to use string instead of wstring but that would be a major project + from what I learnt from googling one is supposed to use wstring on Windows (for example: https://stackoverflow.com/a/402918/1997617). Unfortunately, I need to pass the letters to a third-party framework which doesn't accept wstring, hence the conversion. Also, even if a rewrite would solve the practical problem, the academical one still remains. :-) – Björn Larsson Mar 07 '18 at 07:54
  • @TomBlodget: I'm using the debugger that comes along Visual Studio Community 2017. – Björn Larsson Mar 07 '18 at 07:54
  • 2
    Okay, try writing the converted string to a file and open it with e.g. Notepad++ to see whether the encoding is correct. – Erlkoenig Mar 07 '18 at 08:53

1 Answers1

3

So, I did some additional read-up and experimenting based on the comments I've got.

Turns at the solution is quite simple. Just change CP_UTF8 to CP_ACP!

However... Microsoft suggest that one actually should use CP_UTF8, if you read between the lines at the MSDN method documentation. The note for CP_ACP reads:

This value can be different on different computers, even on the same network. It can be changed on the same computer, leading to stored data becoming irrecoverably corrupted. This value is only intended for temporary use and permanent storage should use UTF-16 or UTF-8 if possible.

Also, the note for the entire method reads:

The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.

So even though this CP_ACP-solution works fine for my test-cases, it remains to see if it is an overall good solution.

Björn Larsson
  • 317
  • 1
  • 5
  • 19
  • The documentation is in reference to *data that the application owns* (like your own files). Doing that allows your program to work on any machine and communicate with any other machine. Data that is owned by the *user* (eg, a text file created by Notepad) or that comes from a non-UTF-16 program might be UTF-8 or might be encoded with the ACP. Internally, the file system is UTF-16 but when names are converted to "ANSI" they use ACP. – Peter Torr - MSFT Jul 29 '18 at 23:23