6

I need a code in C++ to convert a string given in wchar_t* to a UTF-16 string. It must work both on Windows and Linux. I've looked through a lot of web-pages during the search, but the subject still is not clear to me.

As I understand I need to:

  1. Call setlocale with LC_TYPE and UTF-16 encoding.
  2. Use wcstombs to convert wchar_t to UTF-16 string.
  3. Call setlocale to restore previous locale.

Do you know the way I can convert wchar_t* to UTF-16 in a portable way (Windows and Linux)?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Andrei Baskakov
  • 161
  • 2
  • 3
  • Maybe my encoding-related questions [#1](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability), [#2](http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c0x), [#3](http://stackoverflow.com/questions/7562609/what-does-cuchar-provide-and-where-is-it-documented) are of some use. – Kerrek SB Mar 14 '12 at 06:56
  • 2
    Which code set is the `wchar_t` string in? What type do you expect to use to represent the character type in the UTF-16 string? Is this simply a transform between UTF-32 (in the `wchar_t`) and UTF-16 in `uint16_t`? Or are you dealing with codeset conversion too? Portability is a noble goal; it is not always achievable, sadly. Do investigate [ICU](http://icu-project.org/). – Jonathan Leffler Mar 14 '12 at 06:56

5 Answers5

8

There is no single cross-platform method for doing this in C++03 (not without a library). This is in part because wchar_t is itself not the same thing across platforms. Under Windows, wchar_t is a 16-bit value, while on other platforms it is often a 32-bit value. So you would need two different codepaths to do it.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
5

C++11's std::codecvt_utf16 should work, I think.

std::codecvt_utf16 is a std::codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS2 or UCS4 character string (depending on the type of Elem).

See this: http://en.cppreference.com/w/cpp/locale/codecvt_utf16

Pubby
  • 51,882
  • 13
  • 139
  • 180
  • All well and good, except G++ (or, more precisely, libstdc++) doesn't provide the `` header yet, so `std::codecvt_utf16` is not available. – Tom Sep 09 '14 at 05:53
  • 1
    C++11 also introduces `char16_t` and `char32_t` types (and associated `std::basic_string` typedefs) to get away from `wchar_t` platform issues. For instance, use `std::u16string` wherever you need a UTF-16 encoded string. – Remy Lebeau Oct 13 '15 at 01:25
3

You can assume that wchar_t is utf-32 in the non-Windows world. It is true on Linux and Mac OS X and most *nix systems (there are very few exceptions to that, and on systems you will probably never touch :-)

And wchar_t is utf-16 on Windows. So on Windows the conversion function can just do a memcpy :-)

On everything else, the conversion is algorithmic, and pretty simple. So there is no need of fancy support from 3rd party libraries.

Here is the basic algorithm: http://unicode.org/faq/utf_bom.html#utf16-3

And you can probably find find a dozen different implementations if you don't want to write your own :-)

Mihai Nita
  • 5,547
  • 27
  • 27
2

The problem is with wchar_t being rather underspecified. You could use GNU libiconv to do what you want. It accepts special encoding name "wchar_t" as both source and target encoding. That way it will be portable to both Windows and Linux and elsewhere where you can provide libiconv.

wilx
  • 17,697
  • 6
  • 59
  • 114
-1

The g++ compiler appears to support wcstombs?

JTeagle
  • 2,196
  • 14
  • 15