How to check japanese or english character

Question

My code here:

std::string st = "名前hlong"; 
for (int i = 0; i < st.lenght(); i++) 
{ 
   char *ch = st[i];
   if ((int)ch <= 255))
   { 
     //Character is latin. 
   } 
   else 
   { 
     //Character is japanese 
   } 
}

I want to count the number of Japanese and English characters. But it's not working. Please help me resolve this issue. Thanks all.

Do you want to classify the characters, i.e. get *separate* counts for Japanese and Latin characters? — unwind, Mar 25 '14 at 08:41
What have you tried so far? What have worked? What haven't worked? What does your code look like? You do know about [Unicode](http://en.wikipedia.org/wiki/Unicode) and its encodings? — Some programmer dude, Mar 25 '14 at 08:42
std::string st = "名前hlong"; //I want to count the number is japanese in this string. — Nguyễn Hải Long, Mar 25 '14 at 08:43
Does this help:- [std::wstring VS std::string](http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring) and [Handling UTF-8 in C++](http://stackoverflow.com/questions/8513249/handling-utf-8-in-c) — Rahul Tripathi, Mar 25 '14 at 08:44
[Here](https://www.google.com/search?q=japanese+unicode&) you will find tables of Japanese unicode. I guess that's all you need to distinguish UTF8 representations of Japanese characters from "English" ones. — Arne Mertz, Mar 25 '14 at 08:59
Check out https://en.wikipedia.org/wiki/List_of_Unicode_characters & http://social.msdn.microsoft.com/Forums/en-US/2922e8c9-426a-41bd-a4e2-1ca948c6c0ec/how-to-get-a-unicode-value-of-a-character?forum=vcgeneral — Ashwin, Mar 25 '14 at 09:24

score 2 · Answer 1 · answered Mar 30 '14 at 11:00

Actually, you shouldn't use std::string because std::string is byte-oriented and a Japanese character can't be represented as a single byte. You should use std::wstring (or in C++11 std::u16string and std::u32string for UTF-16 and UTF-32).

Consider the following C++11 code:

#include <string>
#include <iostream>
#include <iomanip>

using namespace std;

int main(void) {
        wstring s = L"Привет , 名前 hlong";
        for(wchar_t c: s)
               cout << "Char code = 0x" << hex << int(c) << endl;
        return 0;
}

it's compiled with GCC-4.7 as follows g++ -finput-charset=utf-8 -std=c++11 test_wstring.cc -o test_wstring and produces the following output (0x20 stands for space character):

Char code = 0x41f
Char code = 0x440
Char code = 0x438
Char code = 0x432
Char code = 0x435
Char code = 0x442
Char code = 0x20
Char code = 0x2c
Char code = 0x20
Char code = 0x540d
Char code = 0x524d
Char code = 0x20
Char code = 0x68
Char code = 0x6c
Char code = 0x6f
Char code = 0x6e
Char code = 0x67

As you may see standard ASCII characters are in range 0-0xFF, Cyrillic characters are 0x400+ and Japanese ones are 0x524d and 0x540d. You should check the Unicode tables mentioned in the comments and see what ranges you're interested in. Also you may consider std::codecvt facilities & Co to convert between byte and character-oriented encodings , see http://en.cppreference.com/w/cpp/locale/codecvt

Japanese characters may be representable as multi-byte characters, and it's possible to check for those in a `std::string`, just not by comparing the values to 255. Other than that, I think your answer is more useful than mine, I'll delete mine. — , Mar 30 '14 at 11:04
Well, if we talk about UTF-8 string then, yes, it's possible to check if a given byte belongs to a sequence of bytes representing a single character. Likely, a single Japanese character is represented as 3 bytes in a UTF-8 byte stream. The first byte of the 3 in both 名 and 前 is equal to 0xE5. But it would be a rather boring and complex check because it requires proper decoding of all byte sequences preceding a given byte. So usually it's better to convert a UTF sequence to std::wstring and then use common character access methods of std::basic_string: `[]`, `at()` etc — user3159253, Mar 30 '14 at 11:21
Yes, I completely agree that you shouldn't use `std::string`. I only commented because your answer seems to imply that you *cannot* use `std::string`. But perhaps I'm reading too much into it. — , Mar 30 '14 at 11:23

How to check japanese or english character

1 Answers1