Actually, you shouldn't use std::string because std::string is byte-oriented and a Japanese character can't be represented as a single byte. You should use std::wstring (or in C++11 std::u16string
and std::u32string
for UTF-16 and UTF-32).
Consider the following C++11 code:
#include <string>
#include <iostream>
#include <iomanip>
using namespace std;
int main(void) {
wstring s = L"Привет , 名前 hlong";
for(wchar_t c: s)
cout << "Char code = 0x" << hex << int(c) << endl;
return 0;
}
it's compiled with GCC-4.7 as follows g++ -finput-charset=utf-8 -std=c++11 test_wstring.cc -o test_wstring
and produces the following output (0x20 stands for space character):
Char code = 0x41f
Char code = 0x440
Char code = 0x438
Char code = 0x432
Char code = 0x435
Char code = 0x442
Char code = 0x20
Char code = 0x2c
Char code = 0x20
Char code = 0x540d
Char code = 0x524d
Char code = 0x20
Char code = 0x68
Char code = 0x6c
Char code = 0x6f
Char code = 0x6e
Char code = 0x67
As you may see standard ASCII characters are in range 0-0xFF, Cyrillic characters are 0x400+ and Japanese ones are 0x524d and 0x540d. You should check the Unicode tables mentioned in the comments and see what ranges you're interested in. Also you may consider std::codecvt facilities & Co to convert between byte and character-oriented encodings , see http://en.cppreference.com/w/cpp/locale/codecvt