How does convertion between char and wchar_t work in Windows?

Question

In Windows there are the functions like mbstowcs to convert between char and wchar_t. There are also C++ functions such as from_bytes<std::codecvt<wchar_t, char, std::mbstate_t>> to use.

But how does this work beind the scenes as char and wchar_t are obviously of different size? I assume the system codepage is involved in some way? But what happens if a wchar_t can't be correlated to a char (it can after all contain a lot more values)?

Also what happens if code that has to use char (maybe due to a library) is moved between computers with different codepages? Say that it is only using numbers (0-9) which are well within the range of ASCII, would that always be safe?

And finally, what happens on computers where the local language can't be represented in 256 characters? In that case the concept of char seems completely irrelevant other than for storing for example utf8.

`But what happens if a wchar_t can't be correlated to a char (it can after all contain a lot more values)?` This is exactly the thing to worry about. The real answer is "don't convert to char". If you have wchar_t / UTF16 data and don't want to lose the content, just keep it as it is. (there are, of course, other encodings which can be converted to without losses, but the usual default one-byte-encodings are not among them) — deviantfan, Dec 04 '15 at 11:21
The last parameter of Microsoft's [mbstowcs](https://msdn.microsoft.com/en-us/library/k1f9b8cy.aspx?f=255&MSPPError=-2147217396) is [locale](https://msdn.microsoft.com/en-us/library/wyzd2bce.aspx) which controls how conversion will be performed. [Standard one](http://en.cppreference.com/w/cpp/string/multibyte/mbstowcs) uses [setlocale](http://en.cppreference.com/w/cpp/locale/setlocale). Those are pathological. Internationalization in any robust application should be handled by a dedicated (Unicode) library (icu, Qt, boost, ...) — Ivan Aksamentov - Drop, Dec 04 '15 at 11:21
About the other two problems, again, don't down-convert a unicode encoding to some 256-value-encoding. — deviantfan, Dec 04 '15 at 11:25
Note that just because the Windows API assumes `char`s are in the system codepage, that doesn't mean `char`s always are. Some libraries might assume they are UTF-8, for example, and it is fine to go from `wchar_t` (which is UTF-16 on Windows) to UTF-8. — Simple, Dec 04 '15 at 11:35
@Drop: icu Qt and boost will boil down to the standard functions, otherwise they will not in themselves "robust". They are pre-standard implementation used to define what the standard has to be, and that will be implemented through the standard as well — Emilio Garavaglia, Dec 04 '15 at 11:40

score 1 · Accepted Answer · answered Dec 04 '15 at 11:38

1

It all depends on the cvt facet used, as described here

In your case, (std::codecvt<wchar_t, char, std::mbstate_t>) it all boils down to mbsrtowcs / wcsrtombs using the global locale. (that is the "C" locale, if you don't replace it with the system one)

answered Dec 04 '15 at 11:38

Emilio Garavaglia

20,229
2
46
63

Interesting, so essentially as long as you don't manually change the global locale it will be the C locale and the code will work on any computer. Still, what happens if I have a wchar_t that isn't mapped in the C locale? – DaedalusAlpha Dec 04 '15 at 12:04
In that case the specification don't help. The function may fail (and the specifications say how) or an *approximation* may be attempted. In general -if internationalization is required- better use UTF8 inside the program and do UTF8-to-16 when calling WIN-API. The C locale is for English-based programming languages. But no more that that. The UTF8 locale is so widely accepted that everything else will soon disappear (note that UTF8 is the same a ASCII in the first 127 chars, that are the same as in the C locale, so formal programs remain the same). – Emilio Garavaglia Dec 04 '15 at 12:52

score 0 · Answer 2 · answered Dec 04 '15 at 11:38

I don't know about mbstowcs() but I assume it is similar to std::codecvt<cT, bT, std::mbstate_t>. The latter travels in terms of two types:

A character type cT which is in your code wchar_t.
A byte type bT which is normally char.

The third type in play, std::mbstate_t, is used to store any intermediate state between calls to the std::codecvt<...> facet. The facets can't have any mutable state and any state between calls needs to be obtained somehow. Sadly, the structure of std::mbstate_t is left unspecified, i.e., there is no portable way to actually use it when creating own code conversion facets.

Each instance of std::codecvt<...> implements the conversions between bytes of an external encoding, e.g., UTF8, and characters. Originally, each character was meant to be a stand-alone entity but various reasons (primarily from outside the C++ community, notably from changes made to Unicode) have result in the internal characters effectively being an encoding themselves. Typically the internal encodings used are UTF8 for char and UTF16 or UCS4 for wchar_t (depending on whether wchar_t uses 16 or 32 bits).

The decoding conversions done by std::codecvt<...> take the incoming bytes in the external encoding and turn them into characters of the internal encoding. For example, when the external encoding is UTF8 the incoming bytes are converted to 32 bit code points which are then stuck into UTF16 characters by splitting them up into to wchar_t when necessary (e.g., when wchar_t is 16 bit).

The details of this process are unspecified but it will involve some bit masking and shifting. Also, different transformations will use different approaches. If the mapping between the external and internal encoding isn't as trivial as mapping one Unicode representation to another representation there may be suitable tables providing the actual mapping.

score 0 · Answer 3 · edited May 23 '17 at 12:07

I what is in the char array is actually a UTF-8 encoded string, then you can convert it to and from a UTF-16 encoded wchar_t array using

#include <locale>
#include <codecvt>
#include <string>

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);

as described in more detail at https://stackoverflow.com/a/18597384/6345

How does convertion between char and wchar_t work in Windows?

3 Answers3