Why not wchar_t
and wstring
? Yes, it's 4 bytes on some platforms and 2 bytes on others; still, it has the advantage of having a bunch of string processing RTL routines built around it. Cocoa's NSString/CFString is 2 bytes per character (like wchar_t on Windows), but it's extremely unportable.
You'd have to be careful around persistence and wire formats - make sure they don't depend on the size of wchar_t.
Depends, really, on what's your optimization priority. If you have intense processing (parsing, etc), go with wchar_t. If you'd rather interact smoothly with the host system, opt for whatever format matches the assumptions of the host OS.
Redefining wchar_t
to be two bytes is an option, too. It's -fshort-wchar
with GCC. You'll lose the whole body of wcs* RTL and a good portion of STL, but there will be less codepage translation when interacting with the host system. It happens so that both big-name mobile platforms out there (one fruit-themed, one robot-themed) happen to have two byte strings as their native format, but 4 byte wchar_t by default. -fshort-wchar
works on both, I've tried.
Here's a handy summary of desktop and mobile platforms:
- Windows, Windows Phone, Windows RT, Windows CE: wchar_t is 2 bytes, OS uses UTF-16
- Vanilla desktop Linux: wchar_t is 4 bytes, OS uses UTF-8, various frameworks may use who knows what (Qt, notably, uses UTF-16)
- MacOS X, iOS: wchar_t is 4 bytes, OS uses UTF-16, userland comes with a alternative 2-byte-based string RTL
- Android: wchar_t is 4 bytes, OS uses UTF-8, but the layer of interaction with Java uses UTF-16
- Samsung bada: wchar_t is 2 bytes, the userland API uses UTF-16, POSIX layer is severely crippled anyway so who cares