1

I want to implement some string processing for Japanese in C++ (my system is OSX). That seems much harder than it sounds. I read a lot, but still have problems with basic things.

I want my code to compile and run on other machines too. That seems to exclude the wchar_t data type, by what I read so far.

  • In which datatype should I represent my Japanese character then?
  • If I use char, I get an error that the data doesn't fit into the char-data type. What other datatype should I use then?
  • Is there any acceptable way of processing wide-char languages with standard c++ without pitfalls or will I always create a system depend piece of code if I stick to standard c++?
Claudiordgz
  • 3,023
  • 1
  • 21
  • 48
user1091534
  • 354
  • 6
  • 19
  • 1
    Let´s start with the encoding. Dou you want UTF-8, UTF-16 or UTF-32 ? – deviantfan Mar 26 '14 at 19:27
  • Addition: http://www.cplusplus.com/reference/cwchar/wchar_t/ wchar_t is there in C++, but that alone won´t help you – deviantfan Mar 26 '14 at 19:28
  • possible duplicate of [Cross-platform strings (and Unicode) in C++](http://stackoverflow.com/questions/4169948/cross-platform-strings-and-unicode-in-c) – Claudiordgz Mar 26 '14 at 19:30
  • Could be [Shift-JIS](http://en.wikipedia.org/wiki/Shift_JIS) as well. – tadman Mar 26 '14 at 19:35
  • I want to use UTF-16 and have the possibility to address a single character. If I use a std::string a single Japanese character has length 3 what makes it really complicated to work with them. – user1091534 Mar 26 '14 at 19:42
  • @user1091534 If you stick to just normal Japanese character UTF-16 will always use a single code unit, but things like emoji or a Japanese version of zalgo text will use more than just a single UTF-16 code unit per character. Whatever it is you're doing, solving those problems will solve any issues UTF-8's three bytes per normal Japanese character problem too. If you rely on UTF-16 acting like a 'fixed width' encoding then your code will probably be broken. – bames53 Mar 26 '14 at 19:50
  • Forget about std::string. Doing any sensible processing with a variable length encoding is a huge pain. Japanese is all in the basic multilingual plane, so two-char strings are just right. – Seva Alekseyev Mar 26 '14 at 19:50
  • @SevaAlekseyev Unfortunately Unicode is fundamentally variable width. The only ways to robustly handle characters encoded in Unicode as fixed width are quite expensive. E.g.: using a dynamic allocation for possibly every single character. – bames53 Mar 26 '14 at 19:55
  • UTF-32 isn't :) Compound characters aren't a concern in Japanese. Let's have a glyph vs. codepoint flame war; those are fun. – Seva Alekseyev Mar 26 '14 at 19:56
  • I thought that UTF-8 was pretty much standard. – Nicolas Louis Guillemot Mar 26 '14 at 20:04
  • 1
    For wire formats - indeed it is. For string processing, it's - see above - a pain. – Seva Alekseyev Mar 26 '14 at 20:05
  • @SevaAlekseyev Yep, code points aren't characters. I can't process Japanese zalgo text (e.g., キ͙̜͉̝̱͙̲̇ͦテ̛̻̤̣͔̘̺̜͆̅̏͋̚ィ̰͍͚̪͎̂̋ͦ・͓̤͕͔̘̀ホ̖̘̹̟̅͌ͫ̌̓ワ͚͈̭̲̓͗ͩ̈́イ̗̼͌̈͆ͤ̈́ト) as fixed width even using UTF-32. – bames53 Mar 26 '14 at 20:11
  • That isn't legal Japanese :) – Seva Alekseyev Mar 26 '14 at 20:21
  • That doesn't mean users won't use it or that whatever failures occur in a program that pretends UTF-32 can be treated as fixed width characters are acceptable. – bames53 Mar 27 '14 at 00:49

2 Answers2

3

Why not wchar_t and wstring? Yes, it's 4 bytes on some platforms and 2 bytes on others; still, it has the advantage of having a bunch of string processing RTL routines built around it. Cocoa's NSString/CFString is 2 bytes per character (like wchar_t on Windows), but it's extremely unportable.

You'd have to be careful around persistence and wire formats - make sure they don't depend on the size of wchar_t.

Depends, really, on what's your optimization priority. If you have intense processing (parsing, etc), go with wchar_t. If you'd rather interact smoothly with the host system, opt for whatever format matches the assumptions of the host OS.

Redefining wchar_t to be two bytes is an option, too. It's -fshort-wchar with GCC. You'll lose the whole body of wcs* RTL and a good portion of STL, but there will be less codepage translation when interacting with the host system. It happens so that both big-name mobile platforms out there (one fruit-themed, one robot-themed) happen to have two byte strings as their native format, but 4 byte wchar_t by default. -fshort-wchar works on both, I've tried.

Here's a handy summary of desktop and mobile platforms:

  • Windows, Windows Phone, Windows RT, Windows CE: wchar_t is 2 bytes, OS uses UTF-16
  • Vanilla desktop Linux: wchar_t is 4 bytes, OS uses UTF-8, various frameworks may use who knows what (Qt, notably, uses UTF-16)
  • MacOS X, iOS: wchar_t is 4 bytes, OS uses UTF-16, userland comes with a alternative 2-byte-based string RTL
  • Android: wchar_t is 4 bytes, OS uses UTF-8, but the layer of interaction with Java uses UTF-16
  • Samsung bada: wchar_t is 2 bytes, the userland API uses UTF-16, POSIX layer is severely crippled anyway so who cares
Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281
1
  • In which datatype should I represent my Japanese character then?

The representation you should use depends on what you want to do. There's char32_t which can hold entire codepoints, but that doesn't necessarily solve your issues.

  • If I use char, I get an error that the data doesn't fit into the char-data type. What other datatype should I use then?

You absolutely can store Japanese data in char using the right encoding. For example UTF-8 is very common, and is the default on OS X. The following code works on OS X with clang and linux with gcc. It also works on Windows if the output is redirected to a text file (and using a bit of trickery to wring UTF-8 string literal out of VC++).

#include <iostream>

int main() {
  std::cout << "キティ・ホワイト\n";
}

Other possibilities are 16 bit integral types (UTF-16 and UCS-2 encodings), 32-bit integral types (UCS-4/UTF-32), custom type for holding complete 'characters' in your system (using either dynamic allocation or a limit on combining codepoints, or some other scheme).

  • Is there any acceptable way of processing wide-char languages with standard c++ without pitfalls or will I always create a system depend piece of code if I stick to standard c++?

Whatever this unspecified 'processing' is, if it can be done anywhere then there's a way to do it in standard, portable c++. Depending on what you need, you might want to use a library like ICU, and your choice of library might indicate what representation you use for the text. ICU, for example, adapts to different encodings, but I believe is natively UTF-16.

bames53
  • 86,085
  • 15
  • 179
  • 244