Because Traditional Chinese characters are not continuous on the Unicode table, there is unfortunately not a simple Regex rather than testing them one by one, unless things like \p{Hant}
and \p{Hans}
are supported by Regex.
Inspired by the answer^
pointed by @jdaz
's comment, I wrote a Python script using hanzidentifier
module to generate the Regex that matches the characters unique in Traditional Chinese&
:
from typing import List, Tuple
from hanzidentifier import identify, TRADITIONAL
def main():
block = [
*range(0x4E00, 0x9FFF + 1), # CJK Unified Ideographs
*range(0x3400, 0x4DBF + 1), # CJK Unified Ideographs Extension A
*range(0x20000, 0x2A6DF + 1), # CJK Unified Ideographs Extension B
*range(0x2A700, 0x2B73F + 1), # CJK Unified Ideographs Extension C
*range(0x2B740, 0x2B81F + 1), # CJK Unified Ideographs Extension D
*range(0x2B820, 0x2CEAF + 1), # CJK Unified Ideographs Extension E
*range(0x2CEB0, 0x2EBEF + 1), # CJK Unified Ideographs Extension F
*range(0x30000, 0x3134F + 1), # CJK Unified Ideographs Extension G
*range(0xF900, 0xFAFF + 1), # CJK Compatibility Ideographs
*range(0x2F800, 0x2FA1F + 1), # CJK Compatibility Ideographs Supplement
]
block.sort()
result: List[Tuple[int, int]] = []
for point in block:
char = chr(point)
identify_result = identify(char)
if identify_result is TRADITIONAL:
# is traditional only, save into the result list
if len(result) > 0 and result[-1][1] + 1 == point:
# the current char is right after the last char, just update the range
result[-1] = (result[-1][0], point)
else:
result.append((point, point))
range_regexes: List[str] = []
# now we have a list of ranges, convert them into a regex
for start, end in result:
if start == end:
range_regexes.append(chr(start))
elif start + 1 == end:
range_regexes.append(chr(start))
range_regexes.append(chr(end))
else:
range_regexes.append(f'{chr(start)}-{chr(end)}')
# join them together and wrap into [] to form a regex set
regex_char_set = ''.join(range_regexes)
print(f'[{regex_char_set}]')
if __name__ == '__main__':
main()
This generates the Regex which I've posted here: https://regex101.com/r/FkkHQ1/5 (seems like Stack Overflow doesn't like me posting the generated Regex)
Note that because hanzidentifier
uses CC-CEDICT, and especially it is not using the latest version of CC-CEDICT, definitely, some Traditional characters are not covered, but should be enough for the commonly used characters.
Japanese Kanji is a large set. Luckily, the Japanese Agency for Cultural Affairs has a list of commonly used Kanjis and thus I created this text file for the program to read. After excluding the commonly used Kanjis, I got this Regex: https://regex101.com/r/FkkHQ1/7
Unfortunately, I couldn't find a list of commonly used Korean Hanja. Especially, Hanja are rarely used nowadays. Vietnamese Chữ Nho and Chữ Nôm have almost been wiped out as well.
Footnote:
^
: the Regex in that answer doesn't match all Simplified characters. To get a Regex that matches all Simplified characters (including the ones in Traditional Chinese as well), change if identify_result is TRADITIONAL
to if identify_result is SIMPLIFIED or identify_result is BOTH
, which gives us the Regex: https://regex101.com/r/FkkHQ1/6
&
: this script doesn't filter Japanese Kanji, Korean Hanja, Vietnamese Chữ Nho or Chữ Nôm. You have to modify it to exclude them.