5

I'm trying to convert UTF-16 encoded strings to UCS-4

If I understand correctly, C++11 provides this conversion through codecvt_utf16.

My code is something like:

#include <iostream>
#include <locale>
#include <memory>
#include <codecvt>
#include <string>

using namespace std;

int main()
{
    u16string s;

    s.push_back('h');
    s.push_back('e');
    s.push_back('l');
    s.push_back('l');
    s.push_back('o');

    wstring_convert<codecvt_utf16<wchar_t>, wchar_t> conv;
    wstring ws = conv.from_bytes(reinterpret_cast<const char*> (s.c_str()));

    wcout << ws << endl;

    return 0;
}

Note: the explicit push_backs to get around the fact that my version of clang (Xcode 4.2) doesn't have unicode string literals.

When the code is run, I get terminate exception. Am I doing something illegal here? I was thinking it should work because the const char* that I passed to wstring_convert is UTF-16 encoded, right? I have also considered endianness being the issue, but I have checked that it's not the case.

WrightsCS
  • 50,551
  • 22
  • 134
  • 186
ryaner
  • 3,927
  • 4
  • 20
  • 23
  • In principle, there should be some convenience functions in `` somewhere, but I've [had trouble](http://stackoverflow.com/questions/7562609/what-does-cuchar-provide-and-where-is-it-documented) finding out how those work. – Kerrek SB Dec 16 '11 at 21:05
  • @Kerrek SB `` is locale-dependent, the question is about Unicode to Unicode conversion, no locales involved. – Cubbi Dec 16 '11 at 23:14
  • @Cubbi: Hm, pretty sure that `` has nothing to do with locales, but I might be wrong... – Kerrek SB Dec 16 '11 at 23:22
  • @Kerreck SB: On second thought, you're right, one could c16rtomb followed by mbrtoc32, abstracting the locale-specific multibyte encoding away. – Cubbi Dec 17 '11 at 00:14

1 Answers1

10

Two errors:

1) from_bytes() overload that takes the single const char* expects a null-terminated byte string, but your very second byte is '\0'.

2) your system is likely little-endian, so you need to convert from UTF-16LE to UCS-4:

#include <iostream>
#include <locale>
#include <memory>
#include <codecvt>
#include <string>

using namespace std;

int main()
{
    u16string s;

    s.push_back('h');
    s.push_back('e');
    s.push_back('l');
    s.push_back('l');
    s.push_back('o');

    wstring_convert<codecvt_utf16<wchar_t, 0x10ffff, little_endian>,
                     wchar_t> conv;
    wstring ws = conv.from_bytes(
                     reinterpret_cast<const char*> (&s[0]),
                     reinterpret_cast<const char*> (&s[0] + s.size()));

    wcout << ws << endl;

    return 0;
}

Tested with Visual Studio 2010 SP1 on Windows and CLang++/libc++-svn on Linux.

Cubbi
  • 46,567
  • 13
  • 103
  • 169
  • 1
    PS, this should be using char32_t to guarantee UCS4, of course. The wchar_t version produces UTF-16 where wchar_t is 16 bit. – Cubbi Dec 16 '11 at 23:34
  • This is a very awesome answer, and I salute you for knowing all this really! I'd upvote the answer 3 more times if I could. May I also ask more questions if allowed: 1. Can you tell me the concept of MaxCode which you set to 0x10ffff? Cause I notice that it's actually needed.. 2. Good point about '\0' being the terminator of const char*. Makes me quickly wonder, what would be the corresponding terminator for char16_t* ? Thanks again. – ryaner Dec 16 '11 at 23:38
  • 1
    @ryaner `Maxcode` is just the limit on the acceptable character values, it's only needed here because endianness/BOM handling indicator happens to be the third template parameter, which I think is a small design flaw. The terminating character for a null-terminated array of `chat16_t` is `char16_t()` aka `u'\0'` – Cubbi Dec 17 '11 at 13:47
  • it does not work for example in case of std::u16string s = u"\U00010330"; – rnd_nr_gen Dec 19 '16 at 19:46
  • 1
    @elgcom works for me http://coliru.stacked-crooked.com/a/c8b1bc6d6b6b7c9b – Cubbi Dec 19 '16 at 19:59
  • @Cubbi on clang yes, but on VC++ failed. http://rextester.com/CWIGH25202 – rnd_nr_gen Dec 19 '16 at 20:14
  • @elgcom On Windows, wstring is not UCS-4, it's UCS-2 (or, in some APIs, UTF-16), and your char can't be represented in UCS-2. – Cubbi Dec 19 '16 at 20:41
  • @Cubbi hmm, not really clear to me. Basiclly wstring can store such unicode string. (e.g. wstring = L"\U00010330") and alternatively I can convert such u16string at first to utf8 std::string then to std::wstring without problem. Why codecvt_utf16 can not be smart enough? – rnd_nr_gen Dec 19 '16 at 21:02
  • 1
    @elgcom it's as smart as it can be. All of C++ (and C) standard library assumes wchar_t is 32 bit (informally speaking). Because it's permanently broken in Windows is pretty much the reason we got char32_t in C11 and C++11. – Cubbi Dec 19 '16 at 21:26