There are two ways of doing that. By block (\p{Block=...}
) and by script (\p{Script=...}
). The latter is probably more natural.
I don't know much about Chinese languages, but I think you want \p{Script=Han}
aka \p{Han}
for Chinese.
Japanese uses three scripts:
- Kanij:
\p{Script=Han}
aka \p{Han}
- Hiragana:
\p{Script=Hiragana}
aka \p{Hiragana}
aka \p{Hira}
- Katakana:
\p{Script=Katakana}
aka \p{Katakana}
aka \p{Kana}
You could take a look at perluniprops to find the one you are looking for, or you could use uniprops
* to find which properties match a specific character.
$ uniprops 4E2D
U+4E2D ‹中› \N{CJK UNIFIED IDEOGRAPH-4E2D}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InCJK_UnifiedIdeographs
CJK_Unified_Ideographs L Lo Gr_Base Grapheme_Base Graph GrBase
Han Hani ID_Continue IDC ID_Start IDS Ideo Ideographic Letter
L_ Other_Letter Print UIdeo Unified_Ideograph Word XID_Continue
XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph
X_POSIX_Print X_POSIX_Word
To find out which characters are in a given property, you can use unichars
*. (This is of limited usefulness since most CJK chars aren't named.)
$ unichars -au '\p{Han}'
⺀ U+2E80 CJK RADICAL REPEAT
⺁ U+2E81 CJK RADICAL CLIFF
⺂ U+2E82 CJK RADICAL SECOND ONE
⺃ U+2E83 CJK RADICAL SECOND TWO
⺄ U+2E84 CJK RADICAL SECOND THREE
⺅ U+2E85 CJK RADICAL PERSON
⺆ U+2E86 CJK RADICAL BOX
⺇ U+2E87 CJK RADICAL TABLE
⺈ U+2E88 CJK RADICAL KNIFE ONE
...
* — uniprops
and unichars
are available from the Unicode::Tussle distro.