3

I have a Win32 function which I need to port to iOS:

// Loads UTF-8 file and converts to a UTF-16 string

bool LoadUTF8File(char const *filename, wstring &str)
{
    size_t size;
    bool rc = false;
    void *bytes = LoadFile(filename, &size);
    if(bytes != 0)
    {
        int len = MultiByteToWideChar(CP_UTF8, 0, (LPCCH)bytes, size, 0, 0);
        if(len > 0)
        {
            str.resize(len + 1);
            MultiByteToWideChar(CP_UTF8, 0, (LPCCH)bytes, size, &str[0], len);
            str[len] = '\0';
            rc = true;
        }
        delete[] bytes;
    }
    return rc;
}

// LoadFile returns the loaded file as a block of memory
// There is a 3 byte BOM which MultiByteToWideChar seems to ignore
// The text in the file is encoded as UTF-8

I'm using C++ for this, rather than Objective C, and I've been trying to use mbstowcs and _mbstowcs_l. They don't seem to behave in the same way as MultiByteToWideChar. For example, the accented character at the end of the word attaché is not being correctly converted (the Win32 version correctly converts it). Is there a 'UTF-8 to UTF-16' function in the standard libraries somewhere?

Does the Win32 version have a bug in it which I'm not noticing?

The length returned from MultiByteToWideChar is less than the length return from mbstowcs.

Weirdly, in this small test case

    char *p = "attaché";

    wstring str;
    size_t size = strlen(p);
    setlocale(LC_ALL, "");
    int len = mbstowcs(null, p, size);
    if(len > 0)
    {
        str.resize(len + 1);
        mbstowcs(&str[0], p, size);
        str[len] = '\0';
    }
    TRACE(L"%s\n", str.c_str());

    len = MultiByteToWideChar(CP_UTF8, 0, p, size, null, 0);
    if(len > 0)
    {
        str.resize(len + 1);
        MultiByteToWideChar(CP_UTF8, 0, p, size, &str[0], len);
        str[len] = '\0';
    }
    TRACE(L"%s\n", str.c_str());

I get the correct output from mbcstowcs and MultiByteToWideChar erroneously converts the last character into 65533 (REPLACEMENT_CHARACTER). Now I'm confused...

Charlie Skilbeck
  • 1,081
  • 2
  • 15
  • 38
  • Did you call `setlocale("");` before running `mbstowcs`? – Kerrek SB Oct 09 '12 at 16:23
  • Thanks for this - I wasn't, but this doesn't change the behavior I'm afraid. – Charlie Skilbeck Oct 09 '12 at 16:35
  • Maybe these two questions of mine are of some interest: [#1](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability), [#2](http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11). – Kerrek SB Oct 09 '12 at 16:37
  • 1
    See http://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl – Mark Ransom Oct 09 '12 at 16:43
  • For your test case, is it possible that the source literal isn't UTF-8? Do a binary dump of it. – Mark Ransom Oct 09 '12 at 16:56
  • I saved the file from Notepad++ and chose 'Encode in UTF-8' so I'm hoping it is. The actual bytes are: 61 74 74 61 63 68 c3 a9 (C3 A9 for the accented E) - is that right? – Charlie Skilbeck Oct 09 '12 at 16:58
  • @cskilbeck: Yes, `C3 A9` are the correct UTF-8 encoded bytes for the Unicode character "é" (U+00E9). But in general, I'd recommend avoiding putting non-ASCII characters in your source code to avoid potential encoding issues, and instead use escape sequences such as `"attach\xC3\xA9"` or `L"attach\u00E9"`. – Adam Rosenfield Oct 09 '12 at 17:23
  • In the end I found the source code for a simple UTF8 decoder on the web and it works cross platform, so problem solved. Thanks all for your help. http://lists.w3.org/Archives/Public/www-archive/2009Apr/0000.html – Charlie Skilbeck Oct 10 '12 at 06:30

1 Answers1

0

Are you stuck with using C++ for this or is it just the way you choose so far but are open to do it in Objective-C too ?

In Objective-C you can use [yourUTF8String dataUsingEncoding:NSUTF16StringEncoding] to get NSData containing the bytes of the UTF-16 representation of the string.


Additional hypothesis: Note that your "é" character that does not get correctly converted in your example may also be explained by the fact that your solution may not take NFD form (or NFC form, either one). This means that if the "é" character is encoded in NFD for as in "the character 'e' with a acute accent" it may not be interpreted correctly whereas the NFC form (as in "the accented e character", i.e. the pre-composed character directly) it will. Or vice-versa.

That's just one hypothesis, in fact it depends on what result you have instead of the "é" character you expect, but it's worth checking.

AliSoftware
  • 32,623
  • 6
  • 82
  • 77
  • Thanks, this may well end up being it. I'd like to avoid this route as there are other platforms down the line and a one size fits all would be nice. – Charlie Skilbeck Oct 09 '12 at 17:06