6

If I want to convert a piece of string to UTF-16, say char * xmlbuffer, do I have to convert the type to wchar_t * before encoding to UTF-16? And is char* type reqired before encoding to UTF-8?

How is wchar_t, char related to UTF-8 or UTF-16 or UTF-32 or other transformation format?

Thanks in advance for help!

assylias
  • 321,522
  • 82
  • 660
  • 783
Hunter
  • 151
  • 1
  • 14

3 Answers3

5

No, you don't have to change data types.

About wchar_t: the standard says that

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales.

Unfortunately, it does not say what encoding wchar_t is supposed to have; this is implementation-dependent. So for example given

auto s = L"foo";

you can make absolutely no assumption about what the value of the expression *s is.

However, you can use an std::string as an opaque sequence of bytes that represent text in any transformation format of your choice without issue. Just don't perform standard library string-related operations on it.

Jon
  • 428,835
  • 81
  • 738
  • 806
  • So can I say that using wchar_t for UTF-16 in windows platform is just a matter of choice for convenience, you can absolutely use char for UTF-16 in theory? – Hunter May 03 '12 at 22:20
  • 1
    @Hunter: In theory yes, but in Windows, `wchar_t` is used for UTF-16, and `char` for ASCII and UTF-8. – Mooing Duck May 03 '12 at 22:21
  • 1
    On Windows, `wchar_t` has a known size of 16 bits – David Heffernan May 03 '12 at 22:23
  • @Hunter: Yes and yes (just don't e.g. call `strlen` on a `char*` that holds UTF-16). But if you do use `wchar_t` as UTF-16 portability goes out the window. – Jon May 03 '12 at 22:24
  • @MooingDuck so, just because a string is in wchar_t doesn't mean it is in UTF-16, it can be UTF-8 or UTF-32 or USC2, right? – Hunter May 03 '12 at 22:25
  • @Jon It seems like `unsigned short` is a better choise for UTF-16 with regarding to portability. – Hunter May 03 '12 at 22:26
  • @Hunter: Slightly, but `uint_least16_t` will serve even better – Mooing Duck May 03 '12 at 22:30
  • @Hunter: Usually `wchar_t` means UTF16 on windows, UTF32 in linux. `char` can be used to hold all the formats, but is usually UTF8 or a codepage. Match whatever your framework uses. (If windows, that is `wchar_t` holding UTF16). – Mooing Duck May 03 '12 at 22:30
  • @Jon If I call `strlen` on a `char*` that holds UTF-16, does it return the number of bytes? – Hunter May 03 '12 at 22:46
  • 1
    @Hunter, if you call `strlen` on a UTF-16 string it will probably always return 0 or 1. `strlen` only accepts 8-bit characters, and will stop at the first character that has an upper byte of 0. – Mark Ransom May 03 '12 at 22:53
  • @Hunter: No, because e.g. `L"hello"` would be `00 68 00 65 00 6c 00 6c 00 6f` in UTF-16BE, so `strlen` would return 0. – Jon May 03 '12 at 22:53
  • 1
    @Mooing Duck: `char16_t` is even better, but only recently added to the C++ standard. – dan04 May 03 '12 at 23:02
  • @dan04 is `char16_t` always 16 bit in any platform? – Hunter May 03 '12 at 23:35
  • dan04: Good call, forgot about those! @Hunter: I don't think it's _guaranteed_ to be 16 bits, but it will be on virtually any modern machine. – Mooing Duck May 03 '12 at 23:41
  • According to [open-std.org](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html), `char16_t` has to have the same size and representation as `uint_least16_t`. – dan04 May 04 '12 at 12:50
5

iconv is a POSIX function that can take care of the intermediate encoding step. You can use iconv_open to specify that you have UTF-8 input and that you want UTF-16 output. Then, using the handle returned from iconv_open, you can use iconv (specifying your input buffer and output buffer). When you are done you must call iconv_close on the handle returned from iconv_open to free resources etc.

You will have to peruse your system's documentation about what encodings are supported by iconv and their naming scheme (i.e. what to provide iconv_open). For example, iconv on some systems expect "utf-8" and others it may expect "UTF8" etc.

Windows does not provide a version of iconv, and instead provides it's own UTF formatting functions: MultiByteToWideChar and WideCharToMultiByte.

//UTF8 to UTF16
std::string input = ...
int utf16len = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), input.size(), 
                                               NULL, 0);
std::wstring output(utf16len);
MultiByteToWideChar(CP_UTF8, 0, input.c_str(), input.size(), 
                                &output[0], output.size());
//UTF16 to UTF8
std::wstring input = ...
int utf8len = WideCharToMultiByte(CP_UTF8, 0, input.c_str(), input.size(), 
                                              NULL, 0, NULL, NULL);
std::string output(utf8len);
WideCharToMultiByte(CP_UTF8, 0, input.c_str(), input.size(),
                                &output[0], output.size(), NULL, NULL);
Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
dreamlax
  • 93,976
  • 29
  • 161
  • 209
  • Hunter: note that Windows does not come with iconv, but there are ways to get it. @Dreamlax: Do you mind if we insert my answer into yours as a Windows alternative and remove mine? The concept of using a library is the right one, and yours is clearer about that. – Mooing Duck May 03 '12 at 22:37
  • @MooingDuck: Yeah absolutely, sounds like a good idea. Put mine in yours or yours in mine, whichever. – dreamlax May 03 '12 at 22:40
1

The size of wchar_t is compiler dependent, so its relation to the various unicode formats will vary.

damienh
  • 164
  • 2
  • 8