1

I have a string variable which is a paragraph containing both English and Japanese words. I want to split Japanese from English.

So I use the Unicode to decide whether the character falls into \u+0000~ \u+007F (basic Latin unicode)

But I don't know how to write the Java code to convert char to unicode, and how to compare unicode.

Anyone can give me a sample?

public void split(String str){
    char[]cstr=str.toCharArray();
    String en = "";
    String jp = "";
    for(char c: cstr){
         //(1) To Unicode?
         //(2) How to check whether fall into \u0000 ~ \u007F
         if(is_en) en+=c;
         else jp+=c;
     }
}
Freya Ren
  • 2,086
  • 6
  • 29
  • 39
  • Take a look at http://stackoverflow.com/questions/2220366/get-unicode-value-of-a-character – David says Reinstate Monica Oct 21 '13 at 02:04
  • 1
    This would only tell you if it's an English/Japanese _characters_. What if you have to deal with [romaji](http://en.wikipedia.org/wiki/Romaji)? – Clockwork-Muse Oct 21 '13 at 03:49
  • What about English words like “fiancé”, “rôle”, “coöoperation”, and “belovèd”? You should explain how you intend to *use* the information you get from this splitting. If you really work with *words* only (can you be sure?), then you could classify them into those written in Latin letters and those written in kana or kanji. To check whether they are actually English or Japanese words, you would need dictionaries and something more. – Jukka K. Korpela Oct 21 '13 at 05:45

1 Answers1

1

Assuming the string you have is 16-bit Unicode, and that you aren't trying to go to full Unicode, you can use:

if ('\u0000' <= c && c <= '\u007f')
        { // c is English }
   else { // c is other }

I don't know, however, that this does exactly what you want. Many of the characters in that range are actually punctuation, for instance. And I found a reference here to a set of Unicode characters that are a mix of Roman and "half-width kanji". Just be aware that actually differentiating between all the Unicode characters that might represent English letters and all others might not be this simple, it will depend on your environment.

arcy
  • 12,845
  • 12
  • 58
  • 103