-1

What would be a reliable way in Java to detect if a Chinese Unicode string contains Chinese simplified characters or traditional characters? The assumption is that characters that are common for both simplified and traditional ranges would be treated as simplified by default.

Ideally would be checking for a regex match by specific Unicode character ranges. Are these ranges documented and defined, and would this approach be reliable?

Update

Related questions:

Summary
  • for detecting presence of Chinese characters (both simplified and traditional) a regex like ".*[\\u4E00-\\u9FA5]+.*" can be used
  • to further identify hanzi specifically as Traditional/Simplified the lists extracted from cedict can be used. The exclusive subsets with the common superset removed can be used to get the required differentiation as shown in the sample gist *
ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • [Possible duplicate](https://stackoverflow.com/q/4596576/5133585) – Sweeper Aug 09 '23 at 06:36
  • @Sweeper: I have looked at all duplicates before posting; they don't contain a straightforward answer, but rather 'look them up in the unihan table'.. this question is how to achieve it with a regex using a character class with specific known ranges – ccpizza Aug 09 '23 at 07:48
  • 1
    Well there is no straightforward answer. Finding the list of simplified characters is a necessary and sufficient step to finding the ranges that contain simplified characters, isn't it? Once you find the ranges, you basically found the whole list. Once you find the list, you also know which ranges they are in. – Sweeper Aug 09 '23 at 07:55

1 Answers1

0
public class ChineseCharacterDetector {
    public static boolean containsSimplifiedChinese(String input) {
        for (char c : input.toCharArray()) {
            if (isSimplifiedChinese(c)) {
                return true;
            }
        }
        return false;
    }

    public static boolean containsTraditionalChinese(String input) {
        for (char c : input.toCharArray()) {
            if (isTraditionalChinese(c)) {
                return true;
            }
        }
        return false;
    }

    private static boolean isSimplifiedChinese(char c) {
        // Common simplified Chinese character range
        return (c >= '\u4E00' && c <= '\u9FFF');
    }

    private static boolean isTraditionalChinese(char c) {
        // Common traditional Chinese character ranges
        return (c >= '\u4E00' && c <= '\u9FFF') || // Common characters
               (c >= '\u3400' && c <= '\u4DBF') || // Extended-A
               (c >= '\u20000' && c <= '\u2A6DF'); // Extended-B
    }

    public static void main(String[] args) {
        String input = "你好,世界!Hello, 世界!";
        
        if (containsSimplifiedChinese(input)) {
            System.out.println("Contains Simplified Chinese characters");
        } else if (containsTraditionalChinese(input)) {
            System.out.println("Contains Traditional Chinese characters");
        } else {
            System.out.println("Contains neither Simplified nor Traditional Chinese characters");
        }
    }
}

The isSimplifiedChinese function takes into account characters from the common Simplified Chinese range, whereas the isTraditionalChinese function takes into account characters from the typical Traditional Chinese ranges, as well as certain expanded ranges. The functions containsSimplifiedChinese and containsTraditionalChinese iterate through the input text, looking for characters in the specified ranges.

M. Usman
  • 26
  • 4