20

Using Java how to detect if a String contains Chinese characters?

    String chineseStr = "已下架" ;

if (isChineseString(chineseStr)) {
  System.out.println("The string contains Chinese characters");
}else{
  System.out.println("The string contains Chinese characters");
}

Can you please help me to solve the problem?

Henrik Aasted Sørensen
  • 6,966
  • 11
  • 51
  • 60
Ran Deloun
  • 507
  • 1
  • 4
  • 13
  • 1
    Do you want to distinguish between Chinese characters *as used in China* (mainland and/or Taiwan), or any CJK ideographic would do? For example, 辻 consists of Chinese character *elements*, but was made up in Japan and is only used there. – Seva Alekseyev Jun 11 '20 at 19:33
  • @Seva Alekseyev I just landed into this question: for my case: any chinese / japanese / non-korean character would do; I mean, even those non-used in China like 峠 – SebasSBM May 26 '22 at 04:55
  • I think that's what Joop's answer does. I have a similar logic, and I compare the codepoints against the CJK ranges in the Unicode. The map of Unicode can be found in Wikipedia, among other places. – Seva Alekseyev May 26 '22 at 14:13

3 Answers3

49

Now Character.isIdeographic(int codepoint) would tell wether the codepoint is a CJKV (Chinese, Japanese, Korean and Vietnamese) ideograph.

Nearer is using Character.UnicodeScript.HAN.

So:

System.out.println(containsHanScript("xxx已下架xxx"));

public static boolean containsHanScript(String s) {
    for (int i = 0; i < s.length(); ) {
        int codepoint = s.codePointAt(i);
        i += Character.charCount(codepoint);
        if (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HAN) {
            return true;
        }
    }
    return false;
}

Or in java 8:

public static boolean containsHanScript(String s) {
    return s.codePoints().anyMatch(
            codepoint ->
            Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HAN);
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • 1
    isIdeographic() and UnicodeScript are only JDK 1.7. But In fonts like Consolas ideographic characters are often more or less two spaces wide, so showing an error carret by just counting the chars, be it surrogate or not, works fine. –  Oct 23 '16 at 16:41
  • @j4nbur53 thanks for mentioning [**Character.isIdeographic(cp)**](http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isIdeographic-int-), part of java since 1.7. – Joop Eggen Oct 23 '16 at 18:40
4

A more direct approach:

if ("粽子".matches("[\\u4E00-\\u9FA5]+")) {
    System.out.println("is Chinese");
}

If you also need to catch rarely used and exotic characters then you'll need to add all the ranges: What's the complete range for Chinese characters in Unicode?

ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • 3
    this one doesn't simply detect chinese characters, but tells if the whole string is chinese. Add .* to the beginning and the end to detect any single chinese character. – JanBrus Jun 11 '20 at 12:33
0

You can try with Google API or Language Detection API

Language Detection API contains simple demo. You can try it first.

Ruchira Gayan Ranaweera
  • 34,993
  • 17
  • 75
  • 115