A Unicode codepoint uses 2 or 4 bytes in UTF-16, but uses 1-4 bytes in UTF-8, depending on its value. It is possible for a 2-byte codepoint value in UTF-16 to use 3-4 bytes in UTF-8, thus a UTF-8 string may use more bytes than the corresponding UTF-16 string. UTF-8 tends to be more compact for Latin/Western languages, but UTF-16 tends to be more compact for Eastern Asian languages.
std::(w)string::size()
and CStringT::GetLength()
count the number of encoded codeunits, not the number of codepoints. In your example, "gökhan"
is encoded as:
UTF-16LE: 0x0067 0x00f6 0x006b 0x0068 0x0061 0x006e
UTF-16BE: 0x6700 0xf600 0x6b00 0x6800 0x6100 0x6e00
UTF-8: 0x67 0xc3 0xb6 0x6b 0x68 0x61 0x6e
Notice that ö
is encoded using 1 codeunit in UTF-16 (LE: 0x00f6
, BE: 0xf600
) but uses 2 codeunits in UTF-8 (0xc3 0xb6
). That is why your UTF-8 string has a size of 7 instead of 6.
That being said, when calling WideCharToMultiByte()
and MultiByteToWideChar()
with -1 as the source length, the function has to manually count the characters, and the return value will include room for a null terminator when the destination pointer is NULL. You don't need that extra space when using CStringA/W
, std::(w)string
, etc, and you don't need the overhead of counting characters when the source already knows its length. You should always specify the actual source length when you know it, eg:
CStringA ConvertUnicodeToUTF8(const CStringW& uni)
{
CStringA utf8;
int cc = WideCharToMultiByte(CP_UTF8, 0, uni, uni.GetLength(), NULL, 0, 0, 0);
if (cc > 0)
{
char *buf = utf8.GetBuffer(cc);
if (buf)
{
cc = WideCharToMultiByte(CP_UTF8, 0, uni, uni.GetLength(), buf, cc, 0, 0);
utf8.ReleaseBuffer(cc);
}
}
return utf8;
}