1

My code here:

std::string st = "名前hlong"; 
for (int i = 0; i < st.lenght(); i++) 
{ 
   char *ch = st[i];
   if ((int)ch <= 255))
   { 
     //Character is latin. 
   } 
   else 
   { 
     //Character is japanese 
   } 
}

I want to count the number of Japanese and English characters. But it's not working. Please help me resolve this issue. Thanks all.

  • Do you want to classify the characters, i.e. get *separate* counts for Japanese and Latin characters? – unwind Mar 25 '14 at 08:41
  • 1
    What have you tried so far? What have worked? What haven't worked? What does your code look like? You do know about [Unicode](http://en.wikipedia.org/wiki/Unicode) and its encodings? – Some programmer dude Mar 25 '14 at 08:42
  • Yes, please help me classify the characters. – Nguyễn Hải Long Mar 25 '14 at 08:42
  • std::string st = "名前hlong"; //I want to count the number is japanese in this string. – Nguyễn Hải Long Mar 25 '14 at 08:43
  • Does this help:- [std::wstring VS std::string](http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring) and [Handling UTF-8 in C++](http://stackoverflow.com/questions/8513249/handling-utf-8-in-c) – Rahul Tripathi Mar 25 '14 at 08:44
  • [Here](https://www.google.com/search?q=japanese+unicode&) you will find tables of Japanese unicode. I guess that's all you need to distinguish UTF8 representations of Japanese characters from "English" ones. – Arne Mertz Mar 25 '14 at 08:59
  • Check out https://en.wikipedia.org/wiki/List_of_Unicode_characters & http://social.msdn.microsoft.com/Forums/en-US/2922e8c9-426a-41bd-a4e2-1ca948c6c0ec/how-to-get-a-unicode-value-of-a-character?forum=vcgeneral – Ashwin Mar 25 '14 at 09:24

1 Answers1

2

Actually, you shouldn't use std::string because std::string is byte-oriented and a Japanese character can't be represented as a single byte. You should use std::wstring (or in C++11 std::u16string and std::u32string for UTF-16 and UTF-32).

Consider the following C++11 code:

#include <string>
#include <iostream>
#include <iomanip>

using namespace std;

int main(void) {
        wstring s = L"Привет , 名前 hlong";
        for(wchar_t c: s)
               cout << "Char code = 0x" << hex << int(c) << endl;
        return 0;
}

it's compiled with GCC-4.7 as follows g++ -finput-charset=utf-8 -std=c++11 test_wstring.cc -o test_wstring and produces the following output (0x20 stands for space character):

Char code = 0x41f
Char code = 0x440
Char code = 0x438
Char code = 0x432
Char code = 0x435
Char code = 0x442
Char code = 0x20
Char code = 0x2c
Char code = 0x20
Char code = 0x540d
Char code = 0x524d
Char code = 0x20
Char code = 0x68
Char code = 0x6c
Char code = 0x6f
Char code = 0x6e
Char code = 0x67

As you may see standard ASCII characters are in range 0-0xFF, Cyrillic characters are 0x400+ and Japanese ones are 0x524d and 0x540d. You should check the Unicode tables mentioned in the comments and see what ranges you're interested in. Also you may consider std::codecvt facilities & Co to convert between byte and character-oriented encodings , see http://en.cppreference.com/w/cpp/locale/codecvt

user3159253
  • 16,836
  • 3
  • 30
  • 56
  • Japanese characters may be representable as multi-byte characters, and it's possible to check for those in a `std::string`, just not by comparing the values to 255. Other than that, I think your answer is more useful than mine, I'll delete mine. –  Mar 30 '14 at 11:04
  • Well, if we talk about UTF-8 string then, yes, it's possible to check if a given byte belongs to a sequence of bytes representing a single character. Likely, a single Japanese character is represented as 3 bytes in a UTF-8 byte stream. The first byte of the 3 in both 名 and 前 is equal to 0xE5. But it would be a rather boring and complex check because it requires proper decoding of all byte sequences preceding a given byte. So usually it's better to convert a UTF sequence to std::wstring and then use common character access methods of std::basic_string: `[]`, `at()` etc – user3159253 Mar 30 '14 at 11:21
  • Yes, I completely agree that you shouldn't use `std::string`. I only commented because your answer seems to imply that you *cannot* use `std::string`. But perhaps I'm reading too much into it. –  Mar 30 '14 at 11:23