In C++11 and later, this conversion is in the standard library, in the <codecvt>
header. Here is some sample code that converts between UTF-16, UCS-4 and wchar_t
. (It breaks on libstdc++ 6.4.9 due to a bug that has been fixed in the development tree.)
#include <codecvt>
#include <cstdlib>
#include <cstring>
#include <cwctype>
#include <iostream>
#include <locale>
#include <vector>
using std::cout;
using std::endl;
using std::exit;
using std::memcmp;
using std::size_t;
using std::wcout;
int main(void)
{
constexpr char16_t msg_utf16[] = u"¡Hola, mundo! \U0001F600"; // Shouldn't assume endianness.
constexpr wchar_t msg_w[] = L"¡Hola, mundo! \U0001F600";
constexpr char32_t msg_utf32[] = U"¡Hola, mundo! \U0001F600";
constexpr char msg_utf8[] = u8"¡Hola, mundo! \U0001F600";
// May vary from OS to OS> "" is the most standard, but might require, e.g. "en_US.utf8".
constexpr char locale_name[] = "";
std::locale::global(std::locale(locale_name)); //
wcout.imbue(std::locale());
const std::codecvt_utf16<wchar_t, 0x1FFFF, std::little_endian> converter_w;
const size_t max_len = sizeof(msg_utf16);
std::vector<char> out(max_len);
std::mbstate_t state;
const wchar_t* from_w = nullptr;
char* to_next = nullptr;
converter_w.out( state, msg_w, msg_w+sizeof(msg_w)/sizeof(wchar_t), from_w, out.data(), out.data() + out.size(), to_next );
if (memcmp( msg_utf8, out.data(), sizeof(msg_utf8) ) == 0 ) {
wcout << L"std::codecvt_utf16<wchar_t> converts to UTF-8, not UTF-16!" << endl;
} else if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
wcout << L"std::codecvt_utf16<wchar_t> conversion not equal!" << endl;
} else {
wcout << L"std::codecvt_utf16<wchar_t> conversion is correct." << endl;
}
out.clear();
out.resize(max_len);
const std::codecvt_utf16<char32_t, 0x1FFFF, std::little_endian> converter_u32;
const char32_t* from_u32 = nullptr;
converter_u32.out( state, msg_utf32, msg_utf32+sizeof(msg_utf32)/sizeof(char32_t), from_u32, out.data(), out.data() + out.size(), to_next );
if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
wcout << L"std::codecvt_utf16<char32_t> conversion not equal!" << endl;
} else {
wcout << L"std::codecvt_utf16<char32_t> conversion is correct." << endl;
}
wcout << msg_w << endl;
return EXIT_SUCCESS;
}
Those two facets will be deprecated in C++17, but not all the facets in <codecvt>
are. In particular, the standard library will support std::codecvt<char, char, std::mbstate_t>
, std::codecvt<char16_t, char, std::mbstate_t>
, std::codecvt<char32_t, char, std::mbstate_t>
and std::codecvt<wchar_t, char, std::mbstate_t>
.
You don’t go into the source of this UTF-16 data on Linux, but that might suggest an approach. If it’s to work with files, you can use imbue()
on a stream with a facet to convert the data as it is read and written, and if it’s to work with the Qt framework, both QString
and QTextCodex
provide conversion functions. Still, ICU should support the entire range of UTF-16.
Update 1
The question really was asking how to convert in the opposite direction, from wide strings to UTF-16. My example does that, but if you want to use ICU, it has u_strFromWCS()
, u_strFromUTF32()
and UnicodeString::fromUTF32()
.
If your reason to prefer ICU to the STL is that STL’s converter facets claim to be locale-independent, observe that those ICU converter functions all claim to be locale-independent, too. This is because conversion between different UTF encodings is completely algorithmic and independent of locale! (Other things like sorting order and case mappings are not, but that is.) In fact, STL does allow you to request a converter facet from a specific locale with locale::use_facet<codecvt<...>>()
if you want to, and this is not deprecated in C++17. Only the conversions to and from UTF-8 are required to be implemented this way, however. “In addition, every locale object constructed in a C++ program implements its own (locale-specific) versions of these four specializations.” In my tests, existing implementations of the library do not support locale().use_facet<std_codecvt<wchar_t,char16_t,mbstate_t>>()
.
Update 2
I’m reposting a manual wchar_t
to utf_16
converter from my answer here. It takes a std::wstring
and returns a std::u16string
, but the algorithm could easily be adapted to any other containers. A u16string
will be at least as efficient as any other data structure that requires dynamic memory, though.
One change you might want to make is that I allocate enough memory for the worst possible case, given the length of the input string, then shrink_to_fit()
afterward. This should waste no more memory than encoding your string as UTF-32 did in the first place. However, it’s extremely unlikely that none of your data will be in the BMP, so you could instead make an initial pass to count how much memory the conversion will need, or assume there will be very few surrogate pairs in real-world use and accept the unlikely possibility of having to resize and copy the destination array.
#include <cassert>
#include <cwctype>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <locale>
#include <string>
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic for this to work.
#include <io.h>
#include <fcntl.h>
#include <locale.h>
#endif
using std::size_t;
void init_locale(void)
// Does magic so that wcout can work.
{
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic.
constexpr char cp_utf16le[] = ".1200";
setlocale( LC_ALL, cp_utf16le );
_setmode( _fileno(stdout), _O_U16TEXT );
#else
// The correct locale name may vary by OS, e.g., "en_US.utf8".
constexpr char locale_name[] = "";
std::locale::global(std::locale(locale_name));
std::wcout.imbue(std::locale());
#endif
}
std::u16string make_u16string( const std::wstring& ws )
/* Creates a UTF-16 string from a wide-character string. Any wide characters
* outside the allowed range of UTF-16 are mapped to the sentinel value U+FFFD,
* per the Unicode documentation. (http://www.unicode.org/faq/private_use.html
* retrieved 12 March 2017.) Unpaired surrogates in ws are also converted to
* sentinel values. Noncharacters, however, are left intact. As a fallback,
* if wide characters are the same size as char16_t, this does a more trivial
* construction using that implicit conversion.
*/
{
/* We assume that, if this test passes, a wide-character string is already
* UTF-16, or at least converts to it implicitly without needing surrogate
* pairs.
*/
if ( sizeof(wchar_t) == sizeof(char16_t) ) {
return std::u16string( ws.begin(), ws.end() );
} else {
/* The conversion from UTF-32 to UTF-16 might possibly require surrogates.
* A surrogate pair suffices to represent all wide characters, because all
* characters outside the range will be mapped to the sentinel value
* U+FFFD. Add one character for the terminating NUL.
*/
const size_t max_len = 2 * ws.length() + 1;
// Our temporary UTF-16 string.
std::u16string result;
result.reserve(max_len);
for ( const wchar_t& wc : ws ) {
const std::wint_t chr = wc;
if ( chr < 0 || chr > 0x10FFFF || (chr >= 0xD800 && chr <= 0xDFFF) ) {
// Invalid code point. Replace with sentinel, per Unicode standard:
constexpr char16_t sentinel = u'\uFFFD';
result.push_back(sentinel);
} else if ( chr < 0x10000UL ) { // In the BMP.
result.push_back(static_cast<char16_t>(wc));
} else {
const char16_t leading = static_cast<char16_t>(
((chr-0x10000UL) / 0x400U) + 0xD800U );
const char16_t trailing = static_cast<char16_t>(
((chr-0x10000UL) % 0x400U) + 0xDC00U );
result.append({leading, trailing});
} // end if
} // end for
/* The returned string is shrunken to fit, which might not be the Right
* Thing if there is more to be added to the string.
*/
result.shrink_to_fit();
// We depend here on the compiler to optimize the move constructor.
return result;
} // end if
// Not reached.
}
int main(void)
{
static const std::wstring wtest(L"☪☮∈✡℩☯✝ \U0001F644");
static const std::u16string u16test(u"☪☮∈✡℩☯✝ \U0001F644");
const std::u16string converted = make_u16string(wtest);
init_locale();
std::wcout << L"sizeof(wchar_t) == " << sizeof(wchar_t) << L".\n";
for( size_t i = 0; i <= u16test.length(); ++i ) {
if ( u16test[i] != converted[i] ) {
std::wcout << std::hex << std::showbase
<< std::right << std::setfill(L'0')
<< std::setw(4) << (unsigned)converted[i] << L" ≠ "
<< std::setw(4) << (unsigned)u16test[i] << L" at "
<< i << L'.' << std::endl;
return EXIT_FAILURE;
} // end if
} // end for
std::wcout << wtest << std::endl;
return EXIT_SUCCESS;
}