0

Does the C++ Standard Template Library (STL) provide any method to convert a UTF8 encoded byte buffer into a wstring?

For example:

const unsigned char* szBuf = (const unsigned char*) "d\xC3\xA9j\xC3\xA0 vu";
std::wstring str = method(szBuf); // Should assign "déjà vu" to str

I want to avoid having to implement my own UTF8 conversion code, like this:

const unsigned char* pch = szBuf;    
while (*pch != 0)
{
    if ((*pch & 0x80) == 0)
    {
    str += *pch++;
    }
    else if ((*pch & 0xE0) == 0xC0 && (pch[1] & 0xC0) == 0x80)
    {
        wchar_t ch = (((*pch & 0x1F) >> 2) << 8) +
            ((*pch & 0x03) << 6) +
            (pch[1] & 0x3F);
        str += ch;
        pch += 2;
    }
    else if (...)
    {
        // other cases omitted
    }
}

EDIT: Thanks for your comments and the answer. This code fragment performs the desired conversion:

std::wstring_convert<std::codecvt_utf8<wchar_t>,wchar_t> convert;
str = convert.from_bytes((const char*)szBuf);
  • 2
    For C++11, yes! See the answer here: http://stackoverflow.com/a/7235204/209199 – Cory Nelson Jul 21 '13 at 16:48
  • 1
    With the new `codecvt` facilities, you can convert the UTF-8 to UTF-32, the wide to narrow, and the narrow to UTF-32, and then compare the two UTF-32 sequences. – Kerrek SB Jul 21 '13 at 16:55
  • if c++11 is not an option, you can see utf-cpp libray : http://stackoverflow.com/a/14603997/1497972 – Max Jul 09 '15 at 10:33

1 Answers1

1

In C++11 you can use std::codecvt_utf8. If you don't have that, you may be able to persuade iconv to do what you want; unfortunately, that's not ubiquitous either, not all implementations that have it support UTF-8, and I'm not aware of any way to find out the appropriate thing to pass to iconv_open to do a conversion from wchar_t.

If you don't have either of those things, your best bet is a third-party library such as ICU. Surprisingly, Boost does not appear to have anything to the purpose, although I coulda missed it.

zwol
  • 135,547
  • 38
  • 252
  • 361