47

I've been looking for a way to convert between the Unicode string types and came across this method. Not only do I not completely understand the method (there are no comments) but also the article implies that in future there will be better methods.

If this is the best method, could you please point out what makes it work, and if not I would like to hear suggestions for better methods.

Hydranix
  • 328
  • 3
  • 10
DrYap
  • 6,525
  • 2
  • 31
  • 54

3 Answers3

96

mbstowcs() and wcstombs() don't necessarily convert to UTF-16 or UTF-32, they convert to wchar_t and whatever the locale wchar_t encoding is. All Windows locales uses a two byte wchar_t and UTF-16 as the encoding, but the other major platforms use a 4-byte wchar_t with UTF-32 (or even a non-Unicode encoding for some locales). A platform that only supports single-byte encodings could even have a one byte wchar_t and have the encoding differ by locale. So wchar_t seems to me to be a bad choice for portability and Unicode. *

Some better options have been introduced in C++11; new specializations of std::codecvt, new codecvt classes, and a new template to make using them for conversions very convienent.

First the new template class for using codecvt is std::wstring_convert. Once you've created an instance of a std::wstring_convert class you can easily convert between strings:

std::wstring_convert<...> convert; // ... filled in with a codecvt to do UTF-8 <-> UTF-16
std::string utf8_string = u8"This string has UTF-8 content";
std::u16string utf16_string = convert.from_bytes(utf8_string);
std::string another_utf8_string = convert.to_bytes(utf16_string);

In order to do different conversion you just need different template parameters, one of which is a codecvt facet. Here are some new facets that are easy to use with wstring_convert:

std::codecvt_utf8_utf16<char16_t> // converts between UTF-8 <-> UTF-16
std::codecvt_utf8<char32_t> // converts between UTF-8 <-> UTF-32
std::codecvt_utf8<char16_t> // converts between UTF-8 <-> UCS-2 (warning, not UTF-16! Don't bother using this one)

Examples of using these:

std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string a = convert.to_bytes(u"This string has UTF-16 content");
std::u16string b = convert.from_bytes(u8"blah blah blah");

The new std::codecvt specializations are a bit harder to use because they have a protected destructor. To get around that you can define a subclass that has a destructor, or you can use the std::use_facet template function to get an existing codecvt instance. Also, an issue with these specializations is you can't use them in Visual Studio 2010 because template specialization doesn't work with typedef'd types and that compiler defines char16_t and char32_t as typedefs. Here's an example of defining your own subclass of codecvt:

template <class internT, class externT, class stateT>
struct codecvt : std::codecvt<internT,externT,stateT>
{ ~codecvt(){} };

std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>,char16_t> convert16;
std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> convert32;

The char16_t specialization converts between UTF-16 and UTF-8. The char32_t specialization, UTF-32 and UTF-8.

Note that these new conversions provided by C++11 don't include any way to convert directly between UTF-32 and UTF-16. Instead you just have to combine two instances of std::wstring_convert.


***** I thought I'd add a note on wchar_t and its purpose, to emphasize why it should not generally be used for Unicode or portable internationalized code. The following is a short version of my answer https://stackoverflow.com/a/11107667/365496

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to wchar_t where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). -- [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.

Since that seems to be the primary use in practice for wchar_t you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of same simple algorithms used with ascii strings to work with other languages.

Unfortunately the requirements on wchar_t assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption, so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ I think is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes.

Community
  • 1
  • 1
bames53
  • 86,085
  • 15
  • 179
  • 244
  • Thank you very much for such an in-depth response, this is exactly what I was looking for. Could I just confirm that UTF-16 to UTF-32 would require UTF-16 to UTF-8 and then to UTF-32? – DrYap Aug 30 '11 at 07:51
  • Yes, you have to go through UTF-8. – bames53 Aug 30 '11 at 16:03
  • Actually, there may be a way to go directly between UTF-16 and UTF-32, but I haven't used it so I'm not sure of all the details. Take a look at another C++11 facet: codecvt_utf16. – bames53 Aug 30 '11 at 16:23
  • I had a go at doing this but I had a problem with wstring_convert not existing. Does this not work with g++ yet? – DrYap Aug 31 '11 at 13:40
  • Unfortunately it appears that stdlibc++ hasn't gotten this far even in its latest version. I guess this is what the article you linked to was saying. So the thing that makes the code in that article work is that stdlibc++'s std::codecvt can use libiconv. Take a look at that guy's Converter.h and look at EncSt, codecvt_type, and how EncSt state is initialized. – bames53 Aug 31 '11 at 20:39
  • 2
    BTW, this stuff _is_ implmented in libc++ (not quite yet the standard c++ library for clang), as well as VS2010 (except for the exception I noted). – bames53 Aug 31 '11 at 21:07
  • @bames53: I tried this (http://ideone.com/7u3si) with gcc-4.7.0-beta from yesterday. If it is implemented, I must do something seriously wrong. Can you give me a hint? Both, the ideone-gcc-4.5 and mine gcc-4.7.0-svn "can't find ". Also on my installation `$ include/c++/4.7.0$ find . -type f | xargs grep wstring_convert` reveals nothing. – towi Oct 10 '11 at 07:51
  • 4
    @towi looks like it's still not implemented in gcc. Only MSVC and libc++. – bames53 Oct 10 '11 at 14:46
  • wstring_convert and friends have been deprecated in C++17 – unexpectedvalue May 20 '17 at 07:18
  • They have, and for what, [in my opinion](https://www.reddit.com/r/cpp_questions/comments/5yo0el/stdwstring_convert_and_stdcodecvt_utf8_deprecated/dfbov4r/), is a bad reason. – bames53 May 20 '17 at 09:14
15

I've written helper functions to convert to/from UTF8 strings (C++11):

#include <string>
#include <locale>
#include <codecvt>

using namespace std;

template <typename T>
string toUTF8(const basic_string<T, char_traits<T>, allocator<T>>& source)
{
    string result;

    wstring_convert<codecvt_utf8_utf16<T>, T> convertor;
    result = convertor.to_bytes(source);

    return result;
}

template <typename T>
void fromUTF8(const string& source, basic_string<T, char_traits<T>, allocator<T>>& result)
{
    wstring_convert<codecvt_utf8_utf16<T>, T> convertor;
    result = convertor.from_bytes(source);
}

Usage example:

// Unicode <-> UTF8
{
    wstring uStr = L"Unicode string";
    string str = toUTF8(uStr);

    wstring after;
    fromUTF8(str, after);
    assert(uStr == after);
}

// UTF16 <-> UTF8
{
    u16string uStr;
    uStr.push_back('A');
    string str = toUTF8(uStr);

    u16string after;
    fromUTF8(str, after);
    assert(uStr == after);
}
Dmytro
  • 1,290
  • 17
  • 21
-2

As far as I know, C++ provides no standard methods to convert from or to UTF-32. However, for UTF-16 there are the methods mbstowcs (Multi-Byte to Wide character string), and the inverse, wcstombs.

If you need UTF-32 too, you need iconv, which is in POSIX 2001 but not in standard C, so on Windows you'll need a replacement like libiconv.

Here's an example on how to use mbstowcs:

#include <string>
#include <iostream>
#include <stdlib.h>

using namespace std;

wstring widestring(const string &text);

int main()
{
  string text;
  cout << "Enter something: ";
  cin >> text;

  wcout << L"You entered " << widestring(text) << ".\n";
  return 0;
}

wstring widestring(const string &text)
{
  wstring result;
  result.resize(text.length());
  mbstowcs(&result[0], &text[0], text.length());
  return result;
}

The reverse goes like this:

string mbstring(const wstring &text)
{
  string result;
  result.resize(text.length());
  wcstombs(&result[0], &text[0], text.length());
  return result;
}

Nitpick: Yes, I know, the size of wchar_t is implementation defined, so it could be 4 Bytes (UTF-32). However, I don't know a compiler which does that.

Raphael R.
  • 23,524
  • 1
  • 22
  • 18
  • 7
    GCC on Linux uses UTF-32 for `wchar_t`. – dan04 Aug 29 '11 at 17:12
  • 7
    So far as I know, Windows is the only common platform that uses UTF-16 for wstring. – Head Geek Sep 05 '11 at 20:13
  • 1
    Probably doesn't count as 'common', but I think AIX uses 2 byte wchar_t and UTF-16. – bames53 Oct 05 '11 at 19:03
  • The problem with the reverse function is that you might need a buffer with more elements than there are characters in the original string, e.g. If you converted a wide-string with Japanese, and it was converted to S-JIS, the text would be truncated. If you call `wcstombs` with NULL as the first argument, then the function will return the size of the buffer necessary to store all characters in the original string. Also, prior to C++11, there was no guarantee that elements in a `std::string` were stored contiguously, and from C++11, there is `std::codecvt` which makes this whole ordeal trivial. – dreamlax Nov 14 '13 at 03:22