Use regular expression to match characters appearing in Traditional Chinese ONLY

Question

\p{Han} can be used to match all Chinese characters (in the Han script), which mix both Simplified Chinese and Traditional Chinese.

Is there a regular expression to match only the characters unique in Traditional Chinese? In other words, match any Chinese characters except for the ones in Simplified Chinese. Things like (?!\p{Hans})\p{Hant}.

Furthermore, ideally, if the regular expression can also exclude Japanese Kanji, Korean Hanja, Vietnamese Chữ Nho and Chữ Nôm.

Hong Kong, Taiwan, Macau (and possibly Mainland China?) all have their own standards of what the "Traditional Characters" are, and they sometimes _disagree_. See [this Wikipedia article](https://zh.wikipedia.org/wiki/常用字字形表) for more details. How do you intend to address this? — Sweeper, Aug 22 '20 at 06:42
Could you please give some examples of both characters you want to match and others you don't want to match? That way we can test our attempts. — Bohemian, Aug 22 '20 at 16:24
@bohemian Seems like Stack Overflow doesn't like posting anything other than English. For instance, the different characters between Traditional Chinese Wikipedia homepage https://zh.wikipedia.org/zh-hant/Wikipedia:%E9%A6%96%E9%A1%B5 and the simplified one https://zh.wikipedia.org/zh-hans/Wikipedia:%E9%A6%96%E9%A1%B5. It would be ideal to also exclude commonly used Kanji, Hanja, Chữ Nho and Chữ Nôm. But considering Hanja, Chữ Nho and Chữ Nôm are not common nowadays anyway, we probably should just ignore them. I only can find the list of commonly used Kanji (link in my answer). — Yihao Gao, Aug 22 '20 at 17:19
@Bohemian you may take my similair question https://stackoverflow.com/questions/76070986/get-chinese-punctuation-in-a-string — Qiulang, May 19 '23 at 08:29

Yihao Gao · Answer 1 · 2020-08-22T18:56:03.240

Because Traditional Chinese characters are not continuous on the Unicode table, there is unfortunately not a simple Regex rather than testing them one by one, unless things like \p{Hant} and \p{Hans} are supported by Regex.

Inspired by the answer^{^} pointed by @jdaz's comment, I wrote a Python script using hanzidentifier module to generate the Regex that matches the characters unique in Traditional Chinese^&:

from typing import List, Tuple

from hanzidentifier import identify, TRADITIONAL


def main():
    block = [
        *range(0x4E00, 0x9FFF + 1),  # CJK Unified Ideographs
        *range(0x3400, 0x4DBF + 1),  # CJK Unified Ideographs Extension A
        *range(0x20000, 0x2A6DF + 1),  # CJK Unified Ideographs Extension B
        *range(0x2A700, 0x2B73F + 1),  # CJK Unified Ideographs Extension C
        *range(0x2B740, 0x2B81F + 1),  # CJK Unified Ideographs Extension D
        *range(0x2B820, 0x2CEAF + 1),  # CJK Unified Ideographs Extension E
        *range(0x2CEB0, 0x2EBEF + 1),  # CJK Unified Ideographs Extension F
        *range(0x30000, 0x3134F + 1),  # CJK Unified Ideographs Extension G
        *range(0xF900, 0xFAFF + 1),  # CJK Compatibility Ideographs
        *range(0x2F800, 0x2FA1F + 1),  # CJK Compatibility Ideographs Supplement
    ]
    block.sort()

    result: List[Tuple[int, int]] = []

    for point in block:
        char = chr(point)
        identify_result = identify(char)
        if identify_result is TRADITIONAL:
            # is traditional only, save into the result list
            if len(result) > 0 and result[-1][1] + 1 == point:
                # the current char is right after the last char, just update the range
                result[-1] = (result[-1][0], point)
            else:
                result.append((point, point))

    range_regexes: List[str] = []
    # now we have a list of ranges, convert them into a regex
    for start, end in result:
        if start == end:
            range_regexes.append(chr(start))
        elif start + 1 == end:
            range_regexes.append(chr(start))
            range_regexes.append(chr(end))
        else:
            range_regexes.append(f'{chr(start)}-{chr(end)}')

    # join them together and wrap into [] to form a regex set
    regex_char_set = ''.join(range_regexes)
    print(f'[{regex_char_set}]')


if __name__ == '__main__':
    main()

This generates the Regex which I've posted here: https://regex101.com/r/FkkHQ1/5 (seems like Stack Overflow doesn't like me posting the generated Regex)

Note that because hanzidentifier uses CC-CEDICT, and especially it is not using the latest version of CC-CEDICT, definitely, some Traditional characters are not covered, but should be enough for the commonly used characters.

Japanese Kanji is a large set. Luckily, the Japanese Agency for Cultural Affairs has a list of commonly used Kanjis and thus I created this text file for the program to read. After excluding the commonly used Kanjis, I got this Regex: https://regex101.com/r/FkkHQ1/7

Unfortunately, I couldn't find a list of commonly used Korean Hanja. Especially, Hanja are rarely used nowadays. Vietnamese Chữ Nho and Chữ Nôm have almost been wiped out as well.

Footnote:

^: the Regex in that answer doesn't match all Simplified characters. To get a Regex that matches all Simplified characters (including the ones in Traditional Chinese as well), change if identify_result is TRADITIONAL to if identify_result is SIMPLIFIED or identify_result is BOTH, which gives us the Regex: https://regex101.com/r/FkkHQ1/6

&: this script doesn't filter Japanese Kanji, Korean Hanja, Vietnamese Chữ Nho or Chữ Nôm. You have to modify it to exclude them.

Good to know that, you may also check out my question https://stackoverflow.com/questions/76070986/get-chinese-punctuation-in-a-string — Qiulang, May 19 '23 at 08:30

Use regular expression to match characters appearing in Traditional Chinese ONLY

1 Answers1

Linked