Detect Chinese character in java

Question

Using Java how to detect if a String contains Chinese characters?

    String chineseStr = "已下架" ;

if (isChineseString(chineseStr)) {
  System.out.println("The string contains Chinese characters");
}else{
  System.out.println("The string contains Chinese characters");
}

Can you please help me to solve the problem?

Do you want to distinguish between Chinese characters *as used in China* (mainland and/or Taiwan), or any CJK ideographic would do? For example, 辻 consists of Chinese character *elements*, but was made up in Japan and is only used there. — Seva Alekseyev, Jun 11 '20 at 19:33
@Seva Alekseyev I just landed into this question: for my case: any chinese / japanese / non-korean character would do; I mean, even those non-used in China like 峠 — SebasSBM, May 26 '22 at 04:55
I think that's what Joop's answer does. I have a similar logic, and I compare the codepoints against the CJK ranges in the Unicode. The map of Unicode can be found in Wikipedia, among other places. — Seva Alekseyev, May 26 '22 at 14:13

Joop Eggen · Accepted Answer · 2014-10-14T10:26:40.077

49

Now Character.isIdeographic(int codepoint) would tell wether the codepoint is a CJKV (Chinese, Japanese, Korean and Vietnamese) ideograph.

Nearer is using Character.UnicodeScript.HAN.

So:

System.out.println(containsHanScript("xxx已下架xxx"));

public static boolean containsHanScript(String s) {
    for (int i = 0; i < s.length(); ) {
        int codepoint = s.codePointAt(i);
        i += Character.charCount(codepoint);
        if (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HAN) {
            return true;
        }
    }
    return false;
}

Or in java 8:

public static boolean containsHanScript(String s) {
    return s.codePoints().anyMatch(
            codepoint ->
            Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HAN);
}

edited Oct 14 '14 at 10:26

answered Oct 14 '14 at 10:20

Joop Eggen

107,315
7
83
138

1

isIdeographic() and UnicodeScript are only JDK 1.7. But In fonts like Consolas ideographic characters are often more or less two spaces wide, so showing an error carret by just counting the chars, be it surrogate or not, works fine. – Oct 23 '16 at 16:41
@j4nbur53 thanks for mentioning [**Character.isIdeographic(cp)**](http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isIdeographic-int-), part of java since 1.7. – Joop Eggen Oct 23 '16 at 18:40

ccpizza · Answer 2 · 2020-06-11T12:42:59.327

4

A more direct approach:

if ("粽子".matches("[\\u4E00-\\u9FA5]+")) {
    System.out.println("is Chinese");
}

If you also need to catch rarely used and exotic characters then you'll need to add all the ranges: What's the complete range for Chinese characters in Unicode?

edited Jun 11 '20 at 12:42

answered May 07 '20 at 20:12

ccpizza

28,968
18
162
169

3

this one doesn't simply detect chinese characters, but tells if the whole string is chinese. Add .* to the beginning and the end to detect any single chinese character. – JanBrus Jun 11 '20 at 12:33

score 0 · Answer 3 · answered Oct 14 '14 at 10:02

0

You can try with Google API or Language Detection API

Language Detection API contains simple demo. You can try it first.

answered Oct 14 '14 at 10:02

Ruchira Gayan Ranaweera

34,993
17
75
115

4

This detects languages, not characters. – Karol S Oct 15 '14 at 12:57

Detect Chinese character in java

3 Answers3

Linked