I have a Win32 function which I need to port to iOS:
// Loads UTF-8 file and converts to a UTF-16 string
bool LoadUTF8File(char const *filename, wstring &str)
{
size_t size;
bool rc = false;
void *bytes = LoadFile(filename, &size);
if(bytes != 0)
{
int len = MultiByteToWideChar(CP_UTF8, 0, (LPCCH)bytes, size, 0, 0);
if(len > 0)
{
str.resize(len + 1);
MultiByteToWideChar(CP_UTF8, 0, (LPCCH)bytes, size, &str[0], len);
str[len] = '\0';
rc = true;
}
delete[] bytes;
}
return rc;
}
// LoadFile returns the loaded file as a block of memory
// There is a 3 byte BOM which MultiByteToWideChar seems to ignore
// The text in the file is encoded as UTF-8
I'm using C++ for this, rather than Objective C, and I've been trying to use mbstowcs and _mbstowcs_l. They don't seem to behave in the same way as MultiByteToWideChar. For example, the accented character at the end of the word attaché is not being correctly converted (the Win32 version correctly converts it). Is there a 'UTF-8 to UTF-16' function in the standard libraries somewhere?
Does the Win32 version have a bug in it which I'm not noticing?
The length returned from MultiByteToWideChar is less than the length return from mbstowcs.
Weirdly, in this small test case
char *p = "attaché";
wstring str;
size_t size = strlen(p);
setlocale(LC_ALL, "");
int len = mbstowcs(null, p, size);
if(len > 0)
{
str.resize(len + 1);
mbstowcs(&str[0], p, size);
str[len] = '\0';
}
TRACE(L"%s\n", str.c_str());
len = MultiByteToWideChar(CP_UTF8, 0, p, size, null, 0);
if(len > 0)
{
str.resize(len + 1);
MultiByteToWideChar(CP_UTF8, 0, p, size, &str[0], len);
str[len] = '\0';
}
TRACE(L"%s\n", str.c_str());
I get the correct output from mbcstowcs and MultiByteToWideChar erroneously converts the last character into 65533 (REPLACEMENT_CHARACTER). Now I'm confused...