3

I am developing an application of which the core code base would be cross-platform for Windows, iOS and Android.

My question is: how should I internally represent strings used by this app to be able to effectively use them on all three platforms?

It is important to note, that I use DirectWrite heavily in Windows, of which the API functions usually expect wchar_t* to be passed (btw. the API documentation states that "A pointer to an array of Unicode characters.", I don't know whether this means that they are in UTF-16 encoding or not)

I see three different approaches (however I find it quite difficult to grasp the details of handling Unicode strings with C++ in a cross-platform manner, so maybe I miss some important concept):

  • use std::string internally everywhere (and store the strings in UTF-8 encoding?), and convert them to wchar_t* where it is needed for the DirectWrite API (I don't know what is needed by the text-processing APIs of Android and iOS yet).
  • use std::wstring internally everywhere. If I understand things right, this wouldn't be effective from memory-usage perspective, because a wchar_t is 4 bytes on iOS and Android (and does it mean that i would have to store the string in UTF-16 on Windows, and in UTF-32 on Android/iOS?)
  • create an abstraction for strings with an abstract base class, and implement internal storing specifically for the different platforms.

What would be the best solution? And by the way, are there any existing cross-platform libraries that abstract string handling? (and also, reading and serializing of Unicode strings)

(UPDATE: deleted the part with the question about the difference of char* and std::string.)

Mark Vincze
  • 7,737
  • 8
  • 42
  • 81
  • 1
    '(or std::string? is there a difference)' erm, yes. If you are using C++ there's close to no good reasons to represent text strings using char* :] http://stackoverflow.com/questions/801209/c-char-vs-stdstring – stijn Jul 16 '12 at 11:32

3 Answers3

6

A part of my question comes from my misunderstanding, or not completely understanding how string and wstring classes work in C++ (I am coming from C# background). The differences of the two and pros and cons have been described in this great answer: std::wstring VS std::string.

How string and wstring works

For me, the single most important discovery about string and wstring classes was that semantically they do not represent a piece of encoded text, rather simply a "string" of char or wchar_t. They are more like a simple data array with some string-specific operations (like append and substr) rather than representing text. Neither of them are aware of any kind of string-encoding whatsoever, they handle each char or wchar_t element individually as a separate character.

Encodings

However, on most systems, if you create a string from a string literal with a special character like this:

std::string s("ű");

The ű character will be represented by more than one byte in memory, but that has nothing to do with the std::string class, that is a feature of the compiler as it can encode string literals with UTF8 (not every compiler though). (And string literals prefixed with L will be represented by wchar_t-s in either UTF16 or UTF32 or something else, depending on the compiler).
Thus the string "ű" will be represented in memory with two bytes: 0xC5 0xB1, and the std::string class won't know that those two bytes semantically mean one character (one Unicode code point) in UTF8, hence the sample code:

std::string s("ű");
std::cout << s.length() << std::endl;
std::cout << s.substr(0, 1);

produces the following result (depending on the compiler, some compilers do not take string literals as UTF8, and some compilers depend on the encoding of the source file):

2
�

The size() function returns 2, because the only thing the std::string knows is that it stores two bytes (two chars). And substr works "primitively" as well, it returns a string containing the single char 0xC5, which is displayed as �, because it is not a valid UTF8 character (but that does not bother the std::string).

And from that we can see that who handle encodings are the various text-processing APIs of the platform, like the simple cout, or DirectWrite.

My approach

In my application DirectWrite is very important, which only accepts strings encoded in UTF16 (in the form of wchar_t* pointers). So I decided to store the strings both in memory and in file encoded in UTF16. However, I wanted a cross-platform implementation which can handle the UTF16 strings on Windows, Android and iOS, which is not possible with std::wstring, because its data size (and the encoding it fits to use) is platform-dependent.

To create a cross-platform, strictly UTF16 string class I templated basic_string on a data type which is 2 bytes long. Quite surprisingly - at least for me - I found almost no information about this online, I based my solution on this approach. Here is the code:

// Define this on every platform to be 16 bytes!
typedef unsigned short char16;

struct char16_traits
{
    typedef char16 _E;
    typedef _E char_type;
    typedef int int_type;
    typedef std::streampos pos_type;
    typedef std::streamoff off_type;
    typedef std::mbstate_t state_type;
    static void assign(_E& _X, const _E& _Y)
    {_X = _Y; }
    static bool eq(const _E& _X, const _E& _Y)
    {return (_X == _Y); }
    static bool lt(const _E& _X, const _E& _Y)
    {return (_X < _Y); }
    static int compare(const _E *_U, const _E *_V, size_t _N)
    {return (memcmp(_U, _V, _N * 2)); }
    static size_t length(const _E *_U)
    {
        size_t count = 0;
        while(_U[count] != 0)
        {
            count++;
        }
        return count;
    }
    static _E * copy(_E *_U, const _E *_V, size_t _N)
    {return ((_E *)memcpy(_U, _V, _N * 2)); }
    static const _E * find(const _E *_U, size_t _N, const _E& _C)
    {
        for(int i = 0; i < _N; ++i) {
            if(_U[i] == _C) {
                return &_U[i];
            }
        }
        return 0;
    }
    static _E * move(_E *_U, const _E *_V, size_t _N)
    {return ((_E *)memmove(_U, _V, _N * 2)); }
    static _E * assign(_E *_U, size_t _N, const _E& _C)
    {
        for(size_t i = 0; i < _N; ++i) {
            assign(_U[i], _C);
        }
        return _U;
    }
    static _E to_char_type(const int_type& _C)
    {return ((_E)_C); }
    static int_type to_int_type(const _E& _C)
    {return ((int_type)(_C)); }
    static bool eq_int_type(const int_type& _X, const int_type& _Y)
    {return (_X == _Y); }
    static int_type eof()
    {return (EOF); }
    static int_type not_eof(const int_type& _C)
    {return (_C != eof() ? _C : !eof()); }
};

typedef std::basic_string<unsigned short, char16_traits> utf16string;

Strings are stored with the above class, and the raw UTF16 data is passed to the specific API functions of the various platforms, all of which at the moment seems to support UTF16 encoding.
The implementation might not be perfect, however the append, substr and size functions seem to work properly. I still don't have much experience with string handling in C++ so feel free to comment/edit if I stated something incorrectly.

Community
  • 1
  • 1
Mark Vincze
  • 7,737
  • 8
  • 42
  • 81
2

The difference between std::strings and char* is, that the std::string class uses C++ features and char* does not. An std::string is a container class of chars and defines convenient methods to use it, a char* is a pointer to some memory that you may work with.

If you are looking for some base class that is platform independent I would point you to QString. This is part of the Qt library that aims to reach platform independent implementations of C++. It also is OpenSource, so you can use it to get an idea of how others implement platform independend strings. The documentation is also very good

HaMster
  • 533
  • 1
  • 6
  • 17
  • Thanks, I am gonna look into QString. – Mark Vincze Jul 16 '12 at 11:41
  • It also come with built in internationalization support. If you don't know the framework yet I would recommend giving it a try. The support for mobile development is growing constantly. For a short introduction [read this](http://qt.nokia.com/products/library) – HaMster Jul 16 '12 at 12:43
1

Implementing an abstract class to represent in a different way on each platform seems a bad idea. Extra working implementing and testing (on each platform) and will add more overhead than just using std::wstring (of course you could counter the overhead by not using an abstract class, but instead using #ifdefs to switch the implementation, but still extra work).

Either using std::string or std::wstring everywhere seems the way to go, implement some utility functions to convert the string you choose to the system dependent format and you won't have a problem. I am working on a multi-platform project, which already runs on iOS, Windows, Linux and Mac, in this project I used multibyte std::string and didn't have much problems, never used std::wstring but I don't see why it wouldn't work.

fbafelipe
  • 4,862
  • 2
  • 25
  • 40
  • That means that you always stored the strings in a specific encoding (UTF-8?), and you converted the strings on the fly to other format or encoding if the API you wanted to use needed it? Or all the API you used required the same encoding? – Mark Vincze Jul 16 '12 at 12:17
  • 1
    @Mark I always stored the strings in UTF8, and converted from UTF8 to whatever the system needed. – fbafelipe Jul 16 '12 at 12:31