UTF-16 to UTF8 with WideCharToMultiByte problems

Question

int main(){
//"Chào" in Vietnamese
wchar_t utf16[] =L"\x00ff\x00fe\x0043\x0000\x0068\x0000\x00EO\x0000\x006F";
//Dump utf16: FF FE 43 0 68 0 E 4F 0 6F (right)
int size = WideCharToMultiByte(CP_UTF8,0,utf16,-1,NULL,0,NULL,NULL);
char *utf8 = new char[size];
int k = WideCharToMultiByte(CP_UTF8,0,utf16,-1,utf8 ,size,NULL,NULL);
//Dump utf8: ffffffc3 fffffbf ffffc3 ffffbe 43 0
}

Here is my code, when i convert it string into UTF-8, it show a wrong result, so what is wrong with my code?

For starters, you probably want to convert your whole array, even though it is not a wide character string: It has embedded zeroes. — Deduplicator, Apr 11 '14 at 14:59
So can you suggest a solution for this ?, how to make a properly utf-16 string in C++ — user2477, Apr 11 '14 at 15:07
@MarkRansom: No reason to add to the confusion. Also, the OP will surely stumble across UTF-32 soon, as he already did for UTF-8. — Deduplicator, Apr 11 '14 at 15:09
why somebody remove the answer, it really help for me. There are nothing wrong about the answer. BTW, thanks for helping — user2477, Apr 11 '14 at 15:36

score 0 · Accepted Answer · edited May 23 '17 at 12:21

wchar_t utf16[] = L"\uFEFFChào";
int size = 5;

for (int i = 0; i < size; ++i) {
    std::printf("%X ", utf16[i]);
}

This program prints out: FEFF 43 68 E0 6F

If printing out each wchar_t you've read from a file prints out FF FE 43 0 68 0 E 4F 0 6F then the UTF-16 data is not being read from the file correctly.. Those values represent the UTF-16 string: `L"ÿþC\0h\0à\0o".

You don't show your code for reading from the file, but here's one way to do it correctly:

https://stackoverflow.com/a/10504278/365496

score 0 · Answer 2 · answered Apr 11 '14 at 15:41

You're reading the file incorrectly. Your dump of the input is showing single bytes in wide characters. Your dump of the output is the byte sequence that results from encoding L"\xff\xfe\x43" to UTF-8. The string is being truncated at the first \x0000 in the input.

UTF-16 to UTF8 with WideCharToMultiByte problems

2 Answers2