6

I need to be able to detect Japanese characters in a Java string.

Currently I'm getting the UnicodeBlock and checking to see if it's equal to Character.UnicodeBlock.KATAKANA or Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS, but I'm not 100% that's going to cover everything.

Any suggestions?

Makoto
  • 104,088
  • 27
  • 192
  • 230
David G
  • 3,940
  • 1
  • 22
  • 30

2 Answers2

11

I use the following java method. Might not completely address your requirement though.

<!-- language: lang-java -->
/**
 * Returns if a character is one of Chinese-Japanese-Korean characters.
 * 
 * @param c
 *            the character to be tested
 * @return true if CJK, false otherwise
 */
private boolean isCharCJK(final char c) {
    if ((Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS)) {
        return true;
    }
    return false;
}

Futhermore, these seem they should work for Hiragana and Katakana characters:

private boolean isHiragana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.HIRAGANA);
}

private boolean isKatakana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.KATAKANA);
}
mwjohnson
  • 661
  • 14
  • 26
Rakesh N
  • 2,450
  • 3
  • 25
  • 32
  • 1
    This seems to fail to detect some Japanese and Korean characters. I ended up combining this with https://gist.github.com/TheFinestArtist/2fd1b4aa1d4824fcbaef – Jiechao Wang Jun 21 '18 at 21:04
7

According regular-expressions.info, Japanese isn't made of one script: "There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of."

In which case, this regex should do the trick:

yourString.matches("[\\p{Hiragana}\\p{Katakana}\\p{Han}\\p{Latin}]*+")
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Sorry, I wasn't precise enough ... I want to detect Japanese CHARACTERS in a string, not the character set name. – David G Sep 30 '09 at 18:31
  • Including Latin will match most European languages as well, which I don't think is what the OP wants to check for (although Japanese is sometimes written with Roman characters as well). – Kathy Van Stone Sep 30 '09 at 18:32
  • Han are Chinese characters as well, but I believe you do want to add Hiragana. – Kathy Van Stone Sep 30 '09 at 18:32
  • That's right, there's no way to really know. This character in a string 本 - could be part of chinese or japanese text. And it's neither hiragana nor katakana of any width. – PandaWood Jan 21 '11 at 01:44