A part of my question comes from my misunderstanding, or not completely understanding how string and wstring classes work in C++ (I am coming from C# background).
The differences of the two and pros and cons have been described in this great answer: std::wstring VS std::string.
How string and wstring works
For me, the single most important discovery about string and wstring classes was that semantically they do not represent a piece of encoded text, rather simply a "string" of char or wchar_t. They are more like a simple data array with some string-specific operations (like append and substr) rather than representing text. Neither of them are aware of any kind of string-encoding whatsoever, they handle each char or wchar_t element individually as a separate character.
Encodings
However, on most systems, if you create a string from a string literal with a special character like this:
std::string s("ű");
The ű character will be represented by more than one byte in memory, but that has nothing to do with the std::string class, that is a feature of the compiler as it can encode string literals with UTF8 (not every compiler though). (And string literals prefixed with L will be represented by wchar_t-s in either UTF16 or UTF32 or something else, depending on the compiler).
Thus the string "ű" will be represented in memory with two bytes: 0xC5 0xB1, and the std::string class won't know that those two bytes semantically mean one character (one Unicode code point) in UTF8, hence the sample code:
std::string s("ű");
std::cout << s.length() << std::endl;
std::cout << s.substr(0, 1);
produces the following result (depending on the compiler, some compilers do not take string literals as UTF8, and some compilers depend on the encoding of the source file):
2
�
The size() function returns 2, because the only thing the std::string knows is that it stores two bytes (two chars). And substr works "primitively" as well, it returns a string containing the single char 0xC5, which is displayed as �, because it is not a valid UTF8 character (but that does not bother the std::string).
And from that we can see that who handle encodings are the various text-processing APIs of the platform, like the simple cout, or DirectWrite.
My approach
In my application DirectWrite is very important, which only accepts strings encoded in UTF16 (in the form of wchar_t* pointers). So I decided to store the strings both in memory and in file encoded in UTF16. However, I wanted a cross-platform implementation which can handle the UTF16 strings on Windows, Android and iOS, which is not possible with std::wstring, because its data size (and the encoding it fits to use) is platform-dependent.
To create a cross-platform, strictly UTF16 string class I templated basic_string on a data type which is 2 bytes long. Quite surprisingly - at least for me - I found almost no information about this online, I based my solution on this approach. Here is the code:
// Define this on every platform to be 16 bytes!
typedef unsigned short char16;
struct char16_traits
{
typedef char16 _E;
typedef _E char_type;
typedef int int_type;
typedef std::streampos pos_type;
typedef std::streamoff off_type;
typedef std::mbstate_t state_type;
static void assign(_E& _X, const _E& _Y)
{_X = _Y; }
static bool eq(const _E& _X, const _E& _Y)
{return (_X == _Y); }
static bool lt(const _E& _X, const _E& _Y)
{return (_X < _Y); }
static int compare(const _E *_U, const _E *_V, size_t _N)
{return (memcmp(_U, _V, _N * 2)); }
static size_t length(const _E *_U)
{
size_t count = 0;
while(_U[count] != 0)
{
count++;
}
return count;
}
static _E * copy(_E *_U, const _E *_V, size_t _N)
{return ((_E *)memcpy(_U, _V, _N * 2)); }
static const _E * find(const _E *_U, size_t _N, const _E& _C)
{
for(int i = 0; i < _N; ++i) {
if(_U[i] == _C) {
return &_U[i];
}
}
return 0;
}
static _E * move(_E *_U, const _E *_V, size_t _N)
{return ((_E *)memmove(_U, _V, _N * 2)); }
static _E * assign(_E *_U, size_t _N, const _E& _C)
{
for(size_t i = 0; i < _N; ++i) {
assign(_U[i], _C);
}
return _U;
}
static _E to_char_type(const int_type& _C)
{return ((_E)_C); }
static int_type to_int_type(const _E& _C)
{return ((int_type)(_C)); }
static bool eq_int_type(const int_type& _X, const int_type& _Y)
{return (_X == _Y); }
static int_type eof()
{return (EOF); }
static int_type not_eof(const int_type& _C)
{return (_C != eof() ? _C : !eof()); }
};
typedef std::basic_string<unsigned short, char16_traits> utf16string;
Strings are stored with the above class, and the raw UTF16 data is passed to the specific API functions of the various platforms, all of which at the moment seems to support UTF16 encoding.
The implementation might not be perfect, however the append, substr and size functions seem to work properly. I still don't have much experience with string handling in C++ so feel free to comment/edit if I stated something incorrectly.