I don't know about mbstowcs()
but I assume it is similar to std::codecvt<cT, bT, std::mbstate_t>
. The latter travels in terms of two types:
- A character type
cT
which is in your code wchar_t
.
- A byte type
bT
which is normally char
.
The third type in play, std::mbstate_t
, is used to store any intermediate state between calls to the std::codecvt<...>
facet. The facets can't have any mutable state and any state between calls needs to be obtained somehow. Sadly, the structure of std::mbstate_t
is left unspecified, i.e., there is no portable way to actually use it when creating own code conversion facets.
Each instance of std::codecvt<...>
implements the conversions between bytes of an external encoding, e.g., UTF8, and characters. Originally, each character was meant to be a stand-alone entity but various reasons (primarily from outside the C++ community, notably from changes made to Unicode) have result in the internal characters effectively being an encoding themselves. Typically the internal encodings used are UTF8
for char
and UTF16 or UCS4 for wchar_t
(depending on whether wchar_t
uses 16 or 32 bits).
The decoding conversions done by std::codecvt<...>
take the incoming bytes in the external encoding and turn them into characters of the internal encoding. For example, when the external encoding is UTF8 the incoming bytes are converted to 32 bit code points which are then stuck into UTF16 characters by splitting them up into to wchar_t
when necessary (e.g., when wchar_t
is 16 bit).
The details of this process are unspecified but it will involve some bit masking and shifting. Also, different transformations will use different approaches. If the mapping between the external and internal encoding isn't as trivial as mapping one Unicode representation to another representation there may be suitable tables providing the actual mapping.