How can I detect japanese text in a Java string?

Question

I need to be able to detect Japanese characters in a Java string.

Currently I'm getting the UnicodeBlock and checking to see if it's equal to Character.UnicodeBlock.KATAKANA or Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS, but I'm not 100% that's going to cover everything.

Any suggestions?

score 11 · Answer 1 · edited Jun 25 '14 at 15:09

I use the following java method. Might not completely address your requirement though.

<!-- language: lang-java -->
/**
 * Returns if a character is one of Chinese-Japanese-Korean characters.
 * 
 * @param c
 *            the character to be tested
 * @return true if CJK, false otherwise
 */
private boolean isCharCJK(final char c) {
    if ((Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS)) {
        return true;
    }
    return false;
}

Futhermore, these seem they should work for Hiragana and Katakana characters:

private boolean isHiragana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.HIRAGANA);
}

private boolean isKatakana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.KATAKANA);
}

This seems to fail to detect some Japanese and Korean characters. I ended up combining this with https://gist.github.com/TheFinestArtist/2fd1b4aa1d4824fcbaef — Jiechao Wang, Jun 21 '18 at 21:04

score 7 · Answer 2 · answered Sep 30 '09 at 18:23

7

According regular-expressions.info, Japanese isn't made of one script: "There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of."

In which case, this regex should do the trick:

yourString.matches("[\\p{Hiragana}\\p{Katakana}\\p{Han}\\p{Latin}]*+")

answered Sep 30 '09 at 18:23

Bart Kiers

166,582
36
299
288

Sorry, I wasn't precise enough ... I want to detect Japanese CHARACTERS in a string, not the character set name. – David G Sep 30 '09 at 18:31
Including Latin will match most European languages as well, which I don't think is what the OP wants to check for (although Japanese is sometimes written with Roman characters as well). – Kathy Van Stone Sep 30 '09 at 18:32
Han are Chinese characters as well, but I believe you do want to add Hiragana. – Kathy Van Stone Sep 30 '09 at 18:32
That's right, there's no way to really know. This character in a string 本 - could be part of chinese or japanese text. And it's neither hiragana nor katakana of any width. – PandaWood Jan 21 '11 at 01:44

How can I detect japanese text in a Java string?

2 Answers2

Linked