How do I convert a string in UTF-16 to UTF-8 in C++

Question

Consider:

STDMETHODIMP CFileSystemAPI::setRRConfig( BSTR config_str, VARIANT* ret )
{
mReportReaderFactory.reset( new sbis::report_reader::ReportReaderFactory() );

USES_CONVERSION;
std::string configuration_str = W2A( config_str );

But in config_str I get a string in UTF-16. How can I convert it to UTF-8 in this piece of code?

AndersK · Answer 1 · 2017-01-15T19:20:46.123

You can do something like this

std::string WstrToUtf8Str(const std::wstring& wstr)
{
  std::string retStr;
  if (!wstr.empty())
  {
    int sizeRequired = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, NULL, 0, NULL, NULL);

    if (sizeRequired > 0)
    {
      std::vector<char> utf8String(sizeRequired);
      int bytesConverted = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(),    
                           -1, &utf8String[0], utf8String.size(), NULL, 
                           NULL);
      if (bytesConverted != 0)
      {
        retStr = &utf8String[0];
      }
      else
      {
        std::stringstream err;
        err << __FUNCTION__ 
            << " std::string WstrToUtf8Str failed to convert wstring '"
            << wstr.c_str() << L"'";
        throw std::runtime_error( err.str() );
      }
    }
  }
  return retStr;
}

You can give your BSTR to the function as a std::wstring

Arty · Answer 2 · 2021-01-06T19:26:54.130

I implemented two variants of conversion between UTF-8<->UTF-16<->UTF-32, first variant fully implements all conversions from scratch, second uses standard std::codecvt and std::wstring_convert (these two classes are deprecated starting from C++17, but still exist, also guaranteed to exist in C++11/C++14).

If you don't like my code then you may use almost-single-header C++ library utfcpp, which should be very well tested by many customers.

To convert UTF-8 to UTF-16 just call Utf32To16(Utf8To32(str)) and to convert UTF-16 to UTF-8 call Utf32To8(Utf16To32(str)). Or you may just use my handy helper function UtfConv<std::wstring>(std::string("abc")) for UTF-8 to UTF-16 or UtfConv<std::string>(std::wstring(L"abc")) for UTF-16 to UTF-8, UtfConv actually can convert from any to any Utf-encoded string. See examples of these and other usages inside Test(cs) macro.

Both variants are C++11 compliant. Also they compile in CLang/GCC/MSVC compilers (see "Try it online!" links down below) and tested to work in Windows/Linux OSes.

You have to save both of my code snippets in file with UTF-8 encoding and provide options -finput-charset=UTF-8 -fexec-charset=UTF-8 to CLang/GCC, and options /utf-8 to MSVC. This utf-8 saving and options are needed only if you put literal strings with non-ascii characters, like I did in my code for testing only purposes. To use functions themselves you don't need this utf-8 saving and options.

Inclusions of <windows.h> and <clocale> and <iostream>, also call to SetConsoleOutputCP(65001) and std::setlocale(LC_ALL, "en_US.UTF-8") are needed only for testing purposes to setup and output correctly to UTF-8 console. These things are not needed for conversion functions.

Part of code is not very necessary, I mean UtfHelper-related structure and functions, they are just helper functions for conversion, mainly created to handle in cross-platform way std::wstring, because wchar_t is usually 32-bit on Linux and 16-bit on Windows. Only low-level functions Utf8To32, Utf32To8, Utf16To32, Utf32To16 are the only things that are really needed for conversion.

Variant 1 was created out of Wikipedia description of UTF-8 and UTF-16 encodings.

If you find bugs or any improvements (especially in Variant 1) please tell me, I'll fix them.

Variant 1

Try it online!

#include <string>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <cstdint>

#ifdef _WIN32
    #include <windows.h>
#else
    #include <clocale>
#endif

#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: " + std::string(msg)); }
#define ASSERT(cond) ASSERT_MSG(cond, "")

template <typename U8StrT = std::string>
inline static U8StrT Utf32To8(std::u32string const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    typedef typename U8StrT::value_type VT;
    typedef uint8_t u8;
    U8StrT r;
    for (auto c: s) {
        size_t nby = c <= 0x7FU ? 1 : c <= 0x7FFU ? 2 : c <= 0xFFFFU ? 3 : c <= 0x1FFFFFU ? 4 : c <= 0x3FFFFFFU ? 5 : c <= 0x7FFFFFFFU ? 6 : 7;
        r.push_back(VT(
            nby <= 1 ? u8(c) : (
                (u8(0xFFU) << (8 - nby)) |
                u8(c >> (6 * (nby - 1)))
            )
        ));
        for (size_t i = 1; i < nby; ++i)
            r.push_back(VT(u8(0x80U | (u8(0x3FU) & u8(c >> (6 * (nby - 1 - i)))))));
    }
    return r;
}

template <typename U8StrT>
inline static std::u32string Utf8To32(U8StrT const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    typedef uint8_t u8;
    std::u32string r;
    auto it = (u8 const *)s.c_str(), end = (u8 const *)(s.c_str() + s.length());
    while (it < end) {
        char32_t c = 0;
        if (*it <= 0x7FU) {
            c = *it;
            ++it;
        } else {
            ASSERT((*it & 0xC0U) == 0xC0U);
            size_t nby = 0;
            for (u8 b = *it; (b & 0x80U) != 0; b <<= 1, ++nby) {(void)0;}
            ASSERT(nby <= 7);
            ASSERT((end - it) >= nby);
            c = *it & (u8(0xFFU) >> (nby + 1));
            for (size_t i = 1; i < nby; ++i) {
                ASSERT((it[i] & 0xC0U) == 0x80U);
                c = (c << 6) | (it[i] & 0x3FU);
            }
            it += nby;
        }
        r.push_back(c);
    }
    return r;
}


template <typename U16StrT = std::u16string>
inline static U16StrT Utf32To16(std::u32string const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    typedef typename U16StrT::value_type VT;
    typedef uint16_t u16;
    U16StrT r;
    for (auto c: s) {
        if (c <= 0xFFFFU)
            r.push_back(VT(c));
        else {
            ASSERT(c <= 0x10FFFFU);
            c -= 0x10000U;
            r.push_back(VT(u16(0xD800U | ((c >> 10) & 0x3FFU))));
            r.push_back(VT(u16(0xDC00U | (c & 0x3FFU))));
        }
    }
    return r;
}

template <typename U16StrT>
inline static std::u32string Utf16To32(U16StrT const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    typedef uint16_t u16;
    std::u32string r;
    auto it = (u16 const *)s.c_str(), end = (u16 const *)(s.c_str() + s.length());
    while (it < end) {
        char32_t c = 0;
        if (*it < 0xD800U || *it > 0xDFFFU) {
            c = *it;
            ++it;
        } else if (*it >= 0xDC00U) {
            ASSERT_MSG(false, "Unallowed UTF-16 sequence!");
        } else {
            ASSERT(end - it >= 2);
            c = (*it & 0x3FFU) << 10;
            if ((it[1] < 0xDC00U) || (it[1] > 0xDFFFU)) {
                ASSERT_MSG(false, "Unallowed UTF-16 sequence!");
            } else {
                c |= it[1] & 0x3FFU;
                c += 0x10000U;
            }
            it += 2;
        }
        r.push_back(c);
    }
    return r;
}


template <typename StrT, size_t NumBytes = sizeof(typename StrT::value_type)> struct UtfHelper;
template <typename StrT> struct UtfHelper<StrT, 1> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf8To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To8<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 2> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf16To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To16<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 4> {
    inline static std::u32string UtfTo32(StrT const & s) {
        return std::u32string((char32_t const *)(s.c_str()), (char32_t const *)(s.c_str() + s.length()));
    }
    inline static StrT UtfFrom32(std::u32string const & s) {
        return StrT((typename StrT::value_type const *)(s.c_str()),
            (typename StrT::value_type const *)(s.c_str() + s.length()));
    }
};
template <typename StrT> inline static std::u32string UtfTo32(StrT const & s) {
    return UtfHelper<StrT>::UtfTo32(s);
}
template <typename StrT> inline static StrT UtfFrom32(std::u32string const & s) {
    return UtfHelper<StrT>::UtfFrom32(s);
}
template <typename StrToT, typename StrFromT> inline static StrToT UtfConv(StrFromT const & s) {
    return UtfFrom32<StrToT>(UtfTo32(s));
}

#define Test(cs) \
    std::cout << Utf32To8(Utf8To32(std::string(cs))) << ", "; \
    std::cout << Utf32To8(Utf16To32(Utf32To16(Utf8To32(std::string(cs))))) << ", "; \
    std::cout << Utf32To8(Utf16To32(std::u16string(u##cs))) << ", "; \
    std::cout << Utf32To8(std::u32string(U##cs)) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::u16string>(UtfConv<std::u32string>(UtfConv<std::u32string>(UtfConv<std::u16string>(std::string(cs)))))) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::wstring>(UtfConv<std::string>(UtfConv<std::u32string>(UtfConv<std::u32string>(std::string(cs)))))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::string(cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u16string(u##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::wstring(L##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u32string(U##cs))) << std::endl; \
    std::cout << "UTF-8 num bytes: " << std::dec << Utf32To8(std::u32string(U##cs)).size() << ", "; \
    std::cout << "UTF-16 num bytes: " << std::dec << (Utf32To16(std::u32string(U##cs)).size() * 2) << std::endl;

int main() {
    #ifdef _WIN32
        SetConsoleOutputCP(65001);
    #else
        std::setlocale(LC_ALL, "en_US.UTF-8");
    #endif
    try {
        Test("World");
        Test("Привет");
        Test("");
        Test("");
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

World, World, World, World, World, World, World, World, World, World
UTF-8 num bytes: 5, UTF-16 num bytes: 10
Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет
UTF-8 num bytes: 12, UTF-16 num bytes: 12
, , , , , , , , , 
UTF-8 num bytes: 8, UTF-16 num bytes: 8
, , , , , , , , , 
UTF-8 num bytes: 4, UTF-16 num bytes: 4

Variant 2

Try it online!

#include <string>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <locale>
#include <codecvt>
#include <cstdint>

#ifdef _WIN32
    #include <windows.h>
#else
    #include <clocale>
#endif

#define ASSERT(cond) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "!"); }

// Workaround for some of MSVC compilers.
#if defined(_MSC_VER) && (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif

template <typename U8StrT>
inline static std::u32string Utf8To32(U8StrT const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv_;
    return utf_8_32_conv_.from_bytes((char const *)s.c_str(), (char const *)(s.c_str() + s.length()));
}

template <typename U8StrT = std::string>
inline static U8StrT Utf32To8(std::u32string const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv_;
    std::string res = utf_8_32_conv_.to_bytes(s.c_str(), s.c_str() + s.length());
    return U8StrT(
        (typename U8StrT::value_type const *)(res.c_str()),
        (typename U8StrT::value_type const *)(res.c_str() + res.length()));
}

template <typename U16StrT>
inline static std::u32string Utf16To32(U16StrT const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>, char32_t> utf_16_32_conv_;
    return utf_16_32_conv_.from_bytes((char const *)s.c_str(), (char const *)(s.c_str() + s.length()));
}

template <typename U16StrT = std::u16string>
inline static U16StrT Utf32To16(std::u32string const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>, char32_t> utf_16_32_conv_;
    std::string res = utf_16_32_conv_.to_bytes(s.c_str(), s.c_str() + s.length());
    return U16StrT(
        (typename U16StrT::value_type const *)(res.c_str()),
        (typename U16StrT::value_type const *)(res.c_str() + res.length()));
}


template <typename StrT, size_t NumBytes = sizeof(typename StrT::value_type)> struct UtfHelper;
template <typename StrT> struct UtfHelper<StrT, 1> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf8To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To8<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 2> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf16To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To16<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 4> {
    inline static std::u32string UtfTo32(StrT const & s) {
        return std::u32string((char32_t const *)(s.c_str()), (char32_t const *)(s.c_str() + s.length()));
    }
    inline static StrT UtfFrom32(std::u32string const & s) {
        return StrT((typename StrT::value_type const *)(s.c_str()),
            (typename StrT::value_type const *)(s.c_str() + s.length()));
    }
};
template <typename StrT> inline static std::u32string UtfTo32(StrT const & s) {
    return UtfHelper<StrT>::UtfTo32(s);
}
template <typename StrT> inline static StrT UtfFrom32(std::u32string const & s) {
    return UtfHelper<StrT>::UtfFrom32(s);
}
template <typename StrToT, typename StrFromT> inline static StrToT UtfConv(StrFromT const & s) {
    return UtfFrom32<StrToT>(UtfTo32(s));
}

#define Test(cs) \
    std::cout << Utf32To8(Utf8To32(std::string(cs))) << ", "; \
    std::cout << Utf32To8(Utf16To32(Utf32To16(Utf8To32(std::string(cs))))) << ", "; \
    std::cout << Utf32To8(Utf16To32(std::u16string(u##cs))) << ", "; \
    std::cout << Utf32To8(std::u32string(U##cs)) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::u16string>(UtfConv<std::u32string>(UtfConv<std::u32string>(UtfConv<std::u16string>(std::string(cs)))))) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::wstring>(UtfConv<std::string>(UtfConv<std::u32string>(UtfConv<std::u32string>(std::string(cs)))))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::string(cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u16string(u##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::wstring(L##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u32string(U##cs))) << std::endl; \
    std::cout << "UTF-8 num bytes: " << std::dec << Utf32To8(std::u32string(U##cs)).size() << ", "; \
    std::cout << "UTF-16 num bytes: " << std::dec << (Utf32To16(std::u32string(U##cs)).size() * 2) << std::endl;

int main() {
    #ifdef _WIN32
        SetConsoleOutputCP(65001);
    #else
        std::setlocale(LC_ALL, "en_US.UTF-8");
    #endif
    try {
        Test("World");
        Test("Привет");
        Test("");
        Test("");
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

World, World, World, World, World, World, World, World, World, World
UTF-8 num bytes: 5, UTF-16 num bytes: 10
Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет
UTF-8 num bytes: 12, UTF-16 num bytes: 12
, , , , , , , , , 
UTF-8 num bytes: 8, UTF-16 num bytes: 8
, , , , , , , , , 
UTF-8 num bytes: 4, UTF-16 num bytes: 4

score 2 · Answer 3 · edited Jan 15 '17 at 19:08

2

If you are using C++11 you may check this out:

http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/

edited Jan 15 '17 at 19:08

Peter Mortensen

30,738
21
105
131

answered Jan 30 '14 at 12:49

beardedN5rd

185
6

Could you show me an example, because I do not understand how to work with it. BSTR input parameter in UTF-16le – user3252635 Jan 31 '14 at 09:44
1

haven't the time to create one, but found [link](https://stackoverflow.com/questions/7232710/convert-between-string-u16string-u32string) he covers that very explicitly. i hope that helps – beardedN5rd Jan 31 '14 at 11:39
@user3252635 there is an example in the linked documentation. There are [better examples at cppreference.com](https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16). Also , look at [`std::wstring_convert`](https://en.cppreference.com/w/cpp/locale/wstring_convert). – Remy Lebeau Aug 31 '18 at 14:56

score 0 · Answer 4 · answered Jan 30 '14 at 13:00

0

void encode_unicode_character(char* buffer, int* offset, wchar_t ucs_character)
{
    if (ucs_character <= 0x7F)
    {
        // Plain single-byte ASCII.
        buffer[(*offset)++] = (char) ucs_character;
    }
    else if (ucs_character <= 0x7FF)
    {
        // Two bytes.
        buffer[(*offset)++] = 0xC0 | (ucs_character >> 6);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0xFFFF)
    {
        // Three bytes.
        buffer[(*offset)++] = 0xE0 | (ucs_character >> 12);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x1FFFFF)
    {
        // Four bytes.
        buffer[(*offset)++] = 0xF0 | (ucs_character >> 18);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x3FFFFFF)
    {
        // Five bytes.
        buffer[(*offset)++] = 0xF8 | (ucs_character >> 24);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 18) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x7FFFFFFF)
    {
        // Six bytes.
        buffer[(*offset)++] = 0xFC | (ucs_character >> 30);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 24) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 18) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else
    {
        // Invalid char; don't encode anything.
    }
}

ISO10646-2012 it is all you need to understand UCS.

answered Jan 30 '14 at 13:00

kvv

336
1
13

5

UCS is not in the question. And UCS is not UTF-16. Is your code valid for UTF-16? – rubenvb Jan 30 '14 at 13:40
I did not say that utf-16 is ucs but it's rather part of it like utf-8 is. – kvv Jan 30 '14 at 16:26
@kw I think you should answer the question that was asked rather than the question that you happen to have some code for. Take another read of the question title and observe that the question concerns conversion from UTF-16 to UTF-8. – David Heffernan Jan 31 '14 at 09:21
@DavidHeffernan Well, encode_unicode_character function converts UTF-16 encoded character to UTF-8 encoded array of characters. Did I miss something? – kvv Jan 31 '14 at 09:56
@DavidHeffernan Do you think my code is not part of this discussion? – kvv Jan 31 '14 at 09:57
1

We don't understand why you mention UCS when the question is about UTF-16. We also fail to understand why you present code that converts UTF-32 to UTF-8 when the question is about UTF-16. It is also a mistake to assume that `wchar_t` can be used to hold a UTF-32 character element. On some systems it can, but not all. – David Heffernan Jan 31 '14 at 11:14
@DavidHeffernan I see you guys do not like the word UCS (or abrv what ever). I am really really sorry, I should not use this word, my bad:))) And yeah, you were right: this code works good even for utf-32, btw i did not know that, thanks:) So this function should work both for utf-8 to utf-16le conversion, and utf-8 to utf-32le conversion! I should even better write function prototype like void encode_unicode_character(char* buffer, int* offset, char32_t ucs_character). But if you do not want utf-32le you can comment the proper lines, right?:))) – kvv Jan 31 '14 at 13:03
I did not say that wchar_t type can be used to hold an UTF-32 character element. – kvv Jan 31 '14 at 13:10
To use this code to solve the problem in the Q you need to first decode the UTF16 – David Heffernan Jan 31 '14 at 13:13
@DavidHeffernan really? hmmm, i did not know that... how should i suppose to first decode the UTF16, i mean decode from what to what? so interesting... hmmm.. – kvv Jan 31 '14 at 13:34
The code in your answer converts raw 32 bit Unicode code points to UTF-8. UTF-16 cannot encode all Unicode code points in a single 16 bit character. That much is obvious because there are more than 0xffff Unicode code points. So UTF-16 is a variable width encoding. Some characters are encoded in surrogate pairs. You can read about it here: http://en.wikipedia.org/wiki/UTF-16 – David Heffernan Jan 31 '14 at 13:38
@DavidHeffernan, ok, i have read wiki article... i put U+6C34 and U+007A to my function and it gave me correct answer in utf-8 encoding ... Should I `decode` them before I give it to my function? – kvv Jan 31 '14 at 14:04
@kw You don't understand surrogate pairs yet. Take another read. – David Heffernan Jan 31 '14 at 14:06
@DavidHeffernan now I do understand what `surrogate pairs` is. Let's say I have code point U+1D11E, in utf-16 encoding it becomes like a surrogate pair 0xD834, 0xDD1E, right? – kvv Jan 31 '14 at 14:23
@DavidHeffernan still do not understand what is your point... So you say that in order to convert char32_t c1 = 0x1D11E to utf-8, I should first `decode` this unicode point to surrogate char16_t pair, then translate to utf-8 encoding??? right? – kvv Jan 31 '14 at 14:54
1

@kw No. The other way round. The question asks, and I feel like a stuck record, to convert **from UTF-16**. That means that the input is UTF-16. Your code converts from UTF-32. So to use it one would need to convert from UTF-16 to UTF-32, and then on to UTF-8. – David Heffernan Jan 31 '14 at 14:56
@DavidHeffernan, excuse me, but i think it is nonsense. – kvv Jan 31 '14 at 15:01
@DavidHeffernan, ok, i now understand, i was wrong, if i have input in utf-16 and there're surrogate pairs, function won't work correct, thank you. i'll correct this function. – kvv Jan 31 '14 at 15:16

How do I convert a string in UTF-16 to UTF-8 in C++

4 Answers4

Linked

Related