What would be a reliable way in Java to detect if a Chinese Unicode string contains Chinese simplified characters or traditional characters? The assumption is that characters that are common for both simplified and traditional ranges would be treated as simplified by default.
Ideally would be checking for a regex match by specific Unicode character ranges. Are these ranges documented and defined, and would this approach be reliable?
Update
Related questions:
Summary
- for detecting presence of Chinese characters (both simplified and traditional) a regex like
".*[\\u4E00-\\u9FA5]+.*"
can be used - to further identify hanzi specifically as Traditional/Simplified the lists extracted from cedict can be used. The exclusive subsets with the common superset removed can be used to get the required differentiation as shown in the sample gist *