How to implement C++ asian characters for cross platform?

Question

I want to implement some string processing for Japanese in C++ (my system is OSX). That seems much harder than it sounds. I read a lot, but still have problems with basic things.

I want my code to compile and run on other machines too. That seems to exclude the wchar_t data type, by what I read so far.

In which datatype should I represent my Japanese character then?
If I use char, I get an error that the data doesn't fit into the char-data type. What other datatype should I use then?
Is there any acceptable way of processing wide-char languages with standard c++ without pitfalls or will I always create a system depend piece of code if I stick to standard c++?

Let´s start with the encoding. Dou you want UTF-8, UTF-16 or UTF-32 ? — deviantfan, Mar 26 '14 at 19:27
Addition: http://www.cplusplus.com/reference/cwchar/wchar_t/ wchar_t is there in C++, but that alone won´t help you — deviantfan, Mar 26 '14 at 19:28
possible duplicate of [Cross-platform strings (and Unicode) in C++](http://stackoverflow.com/questions/4169948/cross-platform-strings-and-unicode-in-c) — Claudiordgz, Mar 26 '14 at 19:30
Could be [Shift-JIS](http://en.wikipedia.org/wiki/Shift_JIS) as well. — tadman, Mar 26 '14 at 19:35
I want to use UTF-16 and have the possibility to address a single character. If I use a std::string a single Japanese character has length 3 what makes it really complicated to work with them. — user1091534, Mar 26 '14 at 19:42
@user1091534 If you stick to just normal Japanese character UTF-16 will always use a single code unit, but things like emoji or a Japanese version of zalgo text will use more than just a single UTF-16 code unit per character. Whatever it is you're doing, solving those problems will solve any issues UTF-8's three bytes per normal Japanese character problem too. If you rely on UTF-16 acting like a 'fixed width' encoding then your code will probably be broken. — bames53, Mar 26 '14 at 19:50
Forget about std::string. Doing any sensible processing with a variable length encoding is a huge pain. Japanese is all in the basic multilingual plane, so two-char strings are just right. — Seva Alekseyev, Mar 26 '14 at 19:50
@SevaAlekseyev Unfortunately Unicode is fundamentally variable width. The only ways to robustly handle characters encoded in Unicode as fixed width are quite expensive. E.g.: using a dynamic allocation for possibly every single character. — bames53, Mar 26 '14 at 19:55
UTF-32 isn't :) Compound characters aren't a concern in Japanese. Let's have a glyph vs. codepoint flame war; those are fun. — Seva Alekseyev, Mar 26 '14 at 19:56
For wire formats - indeed it is. For string processing, it's - see above - a pain. — Seva Alekseyev, Mar 26 '14 at 20:05
@SevaAlekseyev Yep, code points aren't characters. I can't process Japanese zalgo text (e.g., キ͙̜͉̝̱͙̲̇ͦテ̛̻̤̣͔̘̺̜͆̅̏͋̚ィ̰͍͚̪͎̂̋ͦ・͓̤͕͔̘̀ホ̖̘̹̟̅͌ͫ̌̓ワ͚͈̭̲̓͗ͩ̈́イ̗̼͌̈͆ͤ̈́ト) as fixed width even using UTF-32. — bames53, Mar 26 '14 at 20:11
That doesn't mean users won't use it or that whatever failures occur in a program that pretends UTF-32 can be treated as fixed width characters are acceptable. — bames53, Mar 27 '14 at 00:49

Seva Alekseyev · Accepted Answer · 2014-03-26T21:53:17.610

Why not wchar_t and wstring? Yes, it's 4 bytes on some platforms and 2 bytes on others; still, it has the advantage of having a bunch of string processing RTL routines built around it. Cocoa's NSString/CFString is 2 bytes per character (like wchar_t on Windows), but it's extremely unportable.

You'd have to be careful around persistence and wire formats - make sure they don't depend on the size of wchar_t.

Depends, really, on what's your optimization priority. If you have intense processing (parsing, etc), go with wchar_t. If you'd rather interact smoothly with the host system, opt for whatever format matches the assumptions of the host OS.

Redefining wchar_t to be two bytes is an option, too. It's -fshort-wchar with GCC. You'll lose the whole body of wcs* RTL and a good portion of STL, but there will be less codepage translation when interacting with the host system. It happens so that both big-name mobile platforms out there (one fruit-themed, one robot-themed) happen to have two byte strings as their native format, but 4 byte wchar_t by default. -fshort-wchar works on both, I've tried.

Here's a handy summary of desktop and mobile platforms:

Windows, Windows Phone, Windows RT, Windows CE: wchar_t is 2 bytes, OS uses UTF-16
Vanilla desktop Linux: wchar_t is 4 bytes, OS uses UTF-8, various frameworks may use who knows what (Qt, notably, uses UTF-16)
MacOS X, iOS: wchar_t is 4 bytes, OS uses UTF-16, userland comes with a alternative 2-byte-based string RTL
Android: wchar_t is 4 bytes, OS uses UTF-8, but the layer of interaction with Java uses UTF-16
Samsung bada: wchar_t is 2 bytes, the userland API uses UTF-16, POSIX layer is severely crippled anyway so who cares

score 1 · Answer 2 · answered Mar 26 '14 at 20:07

In which datatype should I represent my Japanese character then?

The representation you should use depends on what you want to do. There's char32_t which can hold entire codepoints, but that doesn't necessarily solve your issues.

If I use char, I get an error that the data doesn't fit into the char-data type. What other datatype should I use then?

You absolutely can store Japanese data in char using the right encoding. For example UTF-8 is very common, and is the default on OS X. The following code works on OS X with clang and linux with gcc. It also works on Windows if the output is redirected to a text file (and using a bit of trickery to wring UTF-8 string literal out of VC++).

#include <iostream>

int main() {
  std::cout << "キティ・ホワイト\n";
}

Other possibilities are 16 bit integral types (UTF-16 and UCS-2 encodings), 32-bit integral types (UCS-4/UTF-32), custom type for holding complete 'characters' in your system (using either dynamic allocation or a limit on combining codepoints, or some other scheme).

Is there any acceptable way of processing wide-char languages with standard c++ without pitfalls or will I always create a system depend piece of code if I stick to standard c++?

Whatever this unspecified 'processing' is, if it can be done anywhere then there's a way to do it in standard, portable c++. Depending on what you need, you might want to use a library like ICU, and your choice of library might indicate what representation you use for the text. ICU, for example, adapts to different encodings, but I believe is natively UTF-16.

How to implement C++ asian characters for cross platform?

2 Answers2