Convert UTF-32 wide char to UTF-16 wide char in Linux for Supplementary Plane characters

Question

We have a C++ application deployed on RHEL using ICU.

We have a situation where in we need to convert UChar* to wchar_t* on linux. We use u_strToWCS to perform the conversion.

#include <iostream>
#include <wchar.h>

#include "unicode/ustring.h"

void convertUnicodeStringtoWideChar(const UChar* cuniszSource,
                                    const int32_t cunii32SourceLength,
                                    wchar_t*& rpwcharDestination,
                                    int32_t& destCapacity)
{
  UErrorCode uniUErrorCode = U_ZERO_ERROR;

  int32_t pDestLength = 0;

  rpwcharDestination     = 0;
  destCapacity = 0;

  u_strToWCS(rpwcharDestination,
             destCapacity,
             &pDestLength,
             cuniszSource,
             cunii32SourceLength,
             &uniUErrorCode);

  uniUErrorCode = U_ZERO_ERROR;
  rpwcharDestination = new wchar_t[pDestLength+1];
  if(rpwcharDestination)
  {
    destCapacity = pDestLength+1;

    u_strToWCS(rpwcharDestination,
               destCapacity,
               &pDestLength,
               cuniszSource,
               cunii32SourceLength,
               &uniUErrorCode);

    destCapacity = wcslen(rpwcharDestination);
  }
} //function ends

int main()
{
    //                     a       ä       Š       €    (          )
    UChar input[20] = { 0x0061, 0x00e4, 0x0160, 0x20ac, 0xd87e, 0xdd29, 0x0000 };
    wchar_t * output;
    int32_t outlen = 0;
    convertUnicodeStringtoWideChar( input, 6, output, outlen );
    for ( int i = 0; i < outlen; ++i )
    {
        std::cout << std::hex << output[i] << "\n";
    }
    return 0;
}

This works fine for characters entered upto 65535 (as UChar is implemented as uint16_t internally on linux). It fails to convert characters outside Basic Multilingual Plane (eg CJK Unified Ideographs Extension B)

Any ideas on how to perform the conversion?

Update 1: OK. I was looking at wrong directions. u_strToWCS works fine. The problem arises because I need to pass that wide string to a java application on windows using CORBA. Since wchar_t in linux is 32bit, I need to find a way to convert 32bit wchar_t to 16bit wchar_t

Update 2: The code which I have used can be found here

"It fails..." -- Input, observed output, expected output? A `main()` giving those infos, and wrapping this function (plus necessary `#include` statements) into a compilable example? — DevSolar, Mar 20 '17 at 09:06
Does encoding the characters as surrogate pairs with, for example, `u"\U0001F600"` or multi-byte strings with `u8"\U0001F600"`, work? — Davislor, Mar 20 '17 at 09:34
I took the liberty of slapping on some crude `main()` to make it MCVE. Output is `61`, `e4`, `160`, `20ac`, `2f929` -- which is what I would expect. (Note the last unit, which is a non-BMP / UTF-16 surrogate pair in the input.) Repeating the question to the OP, inhowfar does it "fail"? — DevSolar, Mar 20 '17 at 09:44
The ICU FAQ claims that the library supports UTF-16, not UCS-2, and therefore surrogate pairs should work. Does that answer the question of how to encode characters outside the BMP? — Davislor, Mar 20 '17 at 09:51
As UCS-2 did not *know* anything beyond the BMP, it's kind of hard to imagine how the OP could have encoded any non-BMP characters in it. ;-) — DevSolar, Mar 20 '17 at 09:54
@DevSolar Like you, I’m going to wait for clarification of what the inputs and outputs are. As you say, the `main()` function you wrote yourself appears to work. — Davislor, Mar 20 '17 at 17:27
OK. I was looking at wrong directions. u_strToWCS works fine. The problem arises because I need to pass that wide string to a java application on windows using CORBA. Since wchar_t in linux is 32bit, I need to find a way to convert 32bit wchar_t to 16bit wchar_t — D3XT3R, Mar 21 '17 at 09:28
@D3XT3R: You're mixing up container datatype and encoding here, something that makes for less-than-precise communication and, potentially, misunderstanding. There are various **encodings** -- UTF-8, UTF-16 and UTF-32 being the "useful" ones -- and various **datatypes** that can be used for storing them -- `std::string` or `char[]` for UTF-8; `UChar[]`, `icu::UnicodeString`, `std::u16string` and (Windows) `wchar_t` for UTF-16; `UChar32[]`, `std::u32string` and (Unix) `wchar_t` for UTF-32. Each combination has its own pros and cons, unfortunately. — DevSolar, Mar 21 '17 at 10:01
(ctd.) The UCS-2 that Davislor mentioned is what *became* UTF-16, before there was anything beyond the BMP. It's, basically, UTF-16 without the surrogate pairs. (Warning, oversimplification.) It's what Microsoft was looking at when opting for a 16-bit definition of `wchar_t`. Note I didn't list `std::wstring` above. 16 vs. 32 bit is evil enough when handled in terms of `wchar_t`; once you wrap it into `std::wstring` it becomes toxic. Stick to `std::u16string` / `std::u32string` if you want the standard library, and to the ICU side of things if you want the whole bells & whistles of Unicode. — DevSolar, Mar 21 '17 at 10:03
for input of 乕乭乺丕, i received `4e55 4e6d 20044 20049 4e7a 4e15` on Linux and `4e55 4e6d d840 dc44 d840 dc49 4e7a 4e15` on Windows. — D3XT3R, Mar 21 '17 at 10:08
@DevSolar https://groups.google.com/forum/#!topic/comp.soft-sys.ace/yl71mMf0lDA this is my situation — D3XT3R, Mar 21 '17 at 10:12
And what would have been your *expected* output? As I said, on Windows `wchar_t` is 16bit, so your input is "converted" to the appropriate encoding (UTF-16, which admittedly is not a "wide" encoding in the literal meaning of the term). On Linux (32bit `wchar_t`) you get UTF-32 encoding from this function. — DevSolar, Mar 21 '17 at 10:20
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/138619/discussion-between-devsolar-and-d3xt3r). — DevSolar, Mar 21 '17 at 10:23
Thanks for clearing that up. I gave you some sample code that does it using `codecvt_utf16` from the standard library. I could also write some that uses the C++17 facets instead of C++11, if you want to avoid deprecated features. However, to go in the other direction using ICU, you can use either `u_strFromWCS()`, `u_strFromUTF32()`, or `UnicodeString::fromUTF32()`. Note that all of these are just as global as `codecvt_utf16`, not specific to any locale, if that is your reason to prefer ICU. You might also have to convert to UTF-16BE/LE, but this is a flag in the STL. — Davislor, Mar 21 '17 at 14:33

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

In C++11 and later, this conversion is in the standard library, in the <codecvt> header. Here is some sample code that converts between UTF-16, UCS-4 and wchar_t. (It breaks on libstdc++ 6.4.9 due to a bug that has been fixed in the development tree.)

#include <codecvt>
#include <cstdlib>
#include <cstring>
#include <cwctype>
#include <iostream>
#include <locale>
#include <vector>

using std::cout;
using std::endl;
using std::exit;
using std::memcmp;
using std::size_t;

using std::wcout;

int main(void)
{
  constexpr char16_t msg_utf16[] = u"¡Hola, mundo! \U0001F600"; // Shouldn't assume endianness.
  constexpr wchar_t msg_w[] = L"¡Hola, mundo! \U0001F600";
  constexpr char32_t msg_utf32[] = U"¡Hola, mundo! \U0001F600";
  constexpr char msg_utf8[] = u8"¡Hola, mundo! \U0001F600";

  // May vary from OS to OS>  "" is the most standard, but might require, e.g. "en_US.utf8".
  constexpr char locale_name[] = "";
  std::locale::global(std::locale(locale_name)); //
  wcout.imbue(std::locale());

  const std::codecvt_utf16<wchar_t, 0x1FFFF, std::little_endian> converter_w;
  const size_t max_len = sizeof(msg_utf16);
  std::vector<char> out(max_len);
  std::mbstate_t state;
  const wchar_t* from_w = nullptr;
  char* to_next = nullptr;

  converter_w.out( state, msg_w, msg_w+sizeof(msg_w)/sizeof(wchar_t), from_w, out.data(), out.data() + out.size(), to_next );

  
  if (memcmp( msg_utf8, out.data(), sizeof(msg_utf8) ) == 0 ) {
    wcout << L"std::codecvt_utf16<wchar_t> converts to UTF-8, not UTF-16!" << endl;
  } else if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
    wcout << L"std::codecvt_utf16<wchar_t> conversion not equal!" << endl;
  } else {
    wcout << L"std::codecvt_utf16<wchar_t> conversion is correct." << endl;
  }
  out.clear();
  out.resize(max_len);

  const std::codecvt_utf16<char32_t, 0x1FFFF, std::little_endian> converter_u32;
  const char32_t* from_u32 = nullptr;
  converter_u32.out( state, msg_utf32, msg_utf32+sizeof(msg_utf32)/sizeof(char32_t), from_u32, out.data(), out.data() + out.size(), to_next );

  if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
    wcout << L"std::codecvt_utf16<char32_t> conversion not equal!" << endl;
  } else {
    wcout << L"std::codecvt_utf16<char32_t> conversion is correct." << endl;
  }

  wcout << msg_w << endl;
  return EXIT_SUCCESS;
}

Those two facets will be deprecated in C++17, but not all the facets in <codecvt> are. In particular, the standard library will support std::codecvt<char, char, std::mbstate_t>, std::codecvt<char16_t, char, std::mbstate_t>, std::codecvt<char32_t, char, std::mbstate_t> and std::codecvt<wchar_t, char, std::mbstate_t>.

You don’t go into the source of this UTF-16 data on Linux, but that might suggest an approach. If it’s to work with files, you can use imbue() on a stream with a facet to convert the data as it is read and written, and if it’s to work with the Qt framework, both QString and QTextCodex provide conversion functions. Still, ICU should support the entire range of UTF-16.

Update 1

The question really was asking how to convert in the opposite direction, from wide strings to UTF-16. My example does that, but if you want to use ICU, it has u_strFromWCS(), u_strFromUTF32() and UnicodeString::fromUTF32().

If your reason to prefer ICU to the STL is that STL’s converter facets claim to be locale-independent, observe that those ICU converter functions all claim to be locale-independent, too. This is because conversion between different UTF encodings is completely algorithmic and independent of locale! (Other things like sorting order and case mappings are not, but that is.) In fact, STL does allow you to request a converter facet from a specific locale with locale::use_facet<codecvt<...>>() if you want to, and this is not deprecated in C++17. Only the conversions to and from UTF-8 are required to be implemented this way, however. “In addition, every locale object constructed in a C++ program implements its own (locale-specific) versions of these four specializations.” In my tests, existing implementations of the library do not support locale().use_facet<std_codecvt<wchar_t,char16_t,mbstate_t>>().

Update 2

I’m reposting a manual wchar_t to utf_16 converter from my answer here. It takes a std::wstring and returns a std::u16string, but the algorithm could easily be adapted to any other containers. A u16string will be at least as efficient as any other data structure that requires dynamic memory, though.

One change you might want to make is that I allocate enough memory for the worst possible case, given the length of the input string, then shrink_to_fit() afterward. This should waste no more memory than encoding your string as UTF-32 did in the first place. However, it’s extremely unlikely that none of your data will be in the BMP, so you could instead make an initial pass to count how much memory the conversion will need, or assume there will be very few surrogate pairs in real-world use and accept the unlikely possibility of having to resize and copy the destination array.

#include <cassert>
#include <cwctype>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <locale>
#include <string>

#if _WIN32 || _WIN64
// Windows needs a little non-standard magic for this to work.
#include <io.h>
#include <fcntl.h>
#include <locale.h>
#endif

using std::size_t;

void init_locale(void)
// Does magic so that wcout can work.
{
#if _WIN32 || _WIN64
  // Windows needs a little non-standard magic.
  constexpr char cp_utf16le[] = ".1200";
  setlocale( LC_ALL, cp_utf16le );
  _setmode( _fileno(stdout), _O_U16TEXT );
#else
  // The correct locale name may vary by OS, e.g., "en_US.utf8".
  constexpr char locale_name[] = "";
  std::locale::global(std::locale(locale_name));
  std::wcout.imbue(std::locale());
#endif
}

std::u16string make_u16string( const std::wstring& ws )
/* Creates a UTF-16 string from a wide-character string.  Any wide characters
 * outside the allowed range of UTF-16 are mapped to the sentinel value U+FFFD,
 * per the Unicode documentation. (http://www.unicode.org/faq/private_use.html
 * retrieved 12 March 2017.) Unpaired surrogates in ws are also converted to
 * sentinel values.  Noncharacters, however, are left intact.  As a fallback,
 * if wide characters are the same size as char16_t, this does a more trivial
 * construction using that implicit conversion.
 */
{
  /* We assume that, if this test passes, a wide-character string is already
   * UTF-16, or at least converts to it implicitly without needing surrogate
   * pairs.
   */
  if ( sizeof(wchar_t) == sizeof(char16_t) ) {
    return std::u16string( ws.begin(), ws.end() );
  } else {
    /* The conversion from UTF-32 to UTF-16 might possibly require surrogates.
     * A surrogate pair suffices to represent all wide characters, because all
     * characters outside the range will be mapped to the sentinel value
     * U+FFFD.  Add one character for the terminating NUL.
     */
    const size_t max_len = 2 * ws.length() + 1;
    // Our temporary UTF-16 string.
    std::u16string result;

    result.reserve(max_len);

    for ( const wchar_t& wc : ws ) {
      const std::wint_t chr = wc;

      if ( chr < 0 || chr > 0x10FFFF || (chr >= 0xD800 && chr <= 0xDFFF) ) {
        // Invalid code point.  Replace with sentinel, per Unicode standard:
        constexpr char16_t sentinel = u'\uFFFD';
        result.push_back(sentinel);
      } else if ( chr < 0x10000UL ) { // In the BMP.
        result.push_back(static_cast<char16_t>(wc));
      } else {
        const char16_t leading = static_cast<char16_t>( 
          ((chr-0x10000UL) / 0x400U) + 0xD800U );
        const char16_t trailing = static_cast<char16_t>( 
          ((chr-0x10000UL) % 0x400U) + 0xDC00U );

        result.append({leading, trailing});
      } // end if
    } // end for

   /* The returned string is shrunken to fit, which might not be the Right
    * Thing if there is more to be added to the string.
    */
    result.shrink_to_fit();

    // We depend here on the compiler to optimize the move constructor.
    return result;
  } // end if
  // Not reached.
}

int main(void)
{
  static const std::wstring wtest(L"☪☮∈✡℩☯✝ \U0001F644");
  static const std::u16string u16test(u"☪☮∈✡℩☯✝ \U0001F644");
  const std::u16string converted = make_u16string(wtest);

  init_locale();

  std::wcout << L"sizeof(wchar_t) == " << sizeof(wchar_t) << L".\n";

  for( size_t i = 0; i <= u16test.length(); ++i ) {
    if ( u16test[i] != converted[i] ) {
      std::wcout << std::hex << std::showbase
                 << std::right << std::setfill(L'0')
                 << std::setw(4) << (unsigned)converted[i] << L" ≠ "
                 << std::setw(4) << (unsigned)u16test[i] << L" at "
                 << i << L'.' << std::endl;
      return EXIT_FAILURE;
    } // end if
  } // end for

  std::wcout << wtest << std::endl;

  return EXIT_SUCCESS;
}

1) Should be a comment, not an answer (as OP specified ICU -- C++11 might be unavailable, or ICU be set as a requirement, we don't know). 2) The conversions in `` appeared in C++11. 3) The conversions in `` are [*deprecated*](http://en.cppreference.com/w/cpp/header/codecvt) by C++17. ;-) — DevSolar, Mar 20 '17 at 09:19
Corrected the error about when the facets were added. Thank you. If the Linux C++ build environment supports libc++ or any recent version of libstdc++, it supports those facets. In any case, that’s the sample code I happen to have on hand, and it is too late at night for me to write a demo using ICU right now. I can come back to this tomorrow. — Davislor, Mar 20 '17 at 09:30
@DevSolar: OT, but do you know why they have been deprecated? — MikeMB, Mar 21 '17 at 09:46
@Davislor if you can share manual wchar_t* to UTF-16 conversion written by you, it would be helpful — D3XT3R, Mar 22 '17 at 10:10

score 1 · Accepted Answer · answered Apr 03 '17 at 05:21

The following is the code to convert UTF-32 encoded wide characters to UTF-16

//Function to convert a Unicode string from platform-specific "wide characters" (wchar_t) to UTF-16.
void ConvertUTF32ToUTF16(wchar_t* source,
                         const uint32_t sourceLength,
                         wchar_t*& destination,
                         uint32_t& destinationLength)
{

  wchar_t wcharCharacter;
  uint32_t uniui32Counter = 0;

  wchar_t* pwszDestinationStart = destination;
  wchar_t* sourceStart = source;

  if(0 != destination)
  {
    while(uniui32Counter < sourceLength)
    {
      wcharCharacter = *source++;
      if(wcharCharacter <= 0x0000FFFF)
      {
        /* UTF-16 surrogate values are illegal in UTF-32
           0xFFFF or 0xFFFE are both reserved values */
        if(wcharCharacter >= 0xD800 && 
           wcharCharacter <= 0xDFFF)
        {
          *destination++ = 0x0000FFFD;
          destinationLength += 1;
        }
        else
        {
          /* source is a BMP Character */
          destinationLength += 1;
          *destination++ = wcharCharacter;
        }
      }
      else if(wcharCharacter > 0x0010FFFF)
      {
        /* U+10FFFF is the largest code point of Unicode Character Set */
        *destination++ = 0x0000FFFD;
        destinationLength += 1;
      }
      else
      {
        /* source is a character in range 0xFFFF - 0x10FFFF */
        wcharCharacter -= 0x0010000UL;
        *destination++ = (wchar_t)((wcharCharacter >> 10) + 0xD800);
        *destination++ = (wchar_t)((wcharCharacter & 0x3FFUL) + 0xDC00);
        destinationLength += 2;
      }

      ++uniui32Counter;
    }

    destination = pwszDestinationStart;
    destination[destinationLength] = '\0';
  }

  source = sourceStart;
} //function ends

Convert UTF-32 wide char to UTF-16 wide char in Linux for Supplementary Plane characters

2 Answers2

Update 1

Update 2