How Can I Run a Regex that Tests Text for Characters in a Particular Alphabet or Script?

Question

I'd like to make a regex in Perl that will test a string for characters in a particular script. This would be something like:

$text =~ .*P{'Chinese'}.*

Is there a simple way of doing this, for English it's pretty easy by just testing for [a-zA-Z], but for a script like Chinese, or one of the Japanese scripts, I can't figure out any way of doing this short of writing out every character explicitly, which would make for some very ugly code. Ideas? I can't be the first/only person that's wanted to do this.

[This](http://stackoverflow.com/questions/4611425/how-to-count-the-chinese-word-in-a-file-using-regex-in-perl) seems helpful. — TLP, Nov 30 '11 at 22:34
Related: http://stackoverflow.com/questions/6937087/detect-chinese-character-using-perl#6939500 — daxim, Dec 01 '11 at 11:40

Jon Purdy · Accepted Answer · 2011-12-01T18:38:56.887

10

Look at perldoc perluniprops, which provides an exhaustive list of properties you can use with \p. You’ll be interested in \p{CJK_Unified_Ideographs} and related properties such as \p{CJK_Symbols_And_Punctuation}. \p{Hiragana} and \p{Katakana} give you the kana. There is also a \p{Script=...} property for a number of scripts: \p{Han} and \p{Script=Han} match Han characters (Chinese), but there is no corresponding \p{Script=Japanese}, quite simply because Japanese has multiple scripts.

edited Dec 01 '11 at 18:38

answered Nov 30 '11 at 22:56

Jon Purdy

53,300
8
96
166

I thought [Hiragana](http://en.wikipedia.org/wiki/Hiragana) and [Katakana](http://en.wikipedia.org/wiki/Katakana) were just used by in Japanese. – ikegami Dec 01 '11 at 01:42
@ikegami: Right. The OP mentioned Japanese. – Jon Purdy Dec 01 '11 at 01:45
`\p{Kana_Supplement}` is incorrect and deprecated. It should be `\p{Block:KanaSupplement}` aka `\p{InKanaSupplement}`. The block only contains two defined characters, and they are either in Hiragana or Katakana as well. One needs not worry about this block. – ikegami Dec 01 '11 at 02:04

ikegami · Answer 2 · 2011-12-01T02:19:43.130

There are two ways of doing that. By block (\p{Block=...}) and by script (\p{Script=...}). The latter is probably more natural.

I don't know much about Chinese languages, but I think you want \p{Script=Han} aka \p{Han} for Chinese.

Japanese uses three scripts:

Kanij: \p{Script=Han} aka \p{Han}
Hiragana: \p{Script=Hiragana} aka \p{Hiragana} aka \p{Hira}
Katakana: \p{Script=Katakana} aka \p{Katakana} aka \p{Kana}

You could take a look at perluniprops to find the one you are looking for, or you could use uniprops* to find which properties match a specific character.

$ uniprops 4E2D
U+4E2D ‹中› \N{CJK UNIFIED IDEOGRAPH-4E2D}
    \w \pL \p{L_} \p{Lo}
    All Any Alnum Alpha Alphabetic Assigned InCJK_UnifiedIdeographs
    CJK_Unified_Ideographs L Lo Gr_Base Grapheme_Base Graph GrBase
    Han Hani ID_Continue IDC ID_Start IDS Ideo Ideographic Letter
    L_ Other_Letter Print UIdeo Unified_Ideograph Word XID_Continue
    XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph
    X_POSIX_Print X_POSIX_Word

To find out which characters are in a given property, you can use unichars*. (This is of limited usefulness since most CJK chars aren't named.)

$ unichars -au '\p{Han}'
 ⺀ U+2E80 CJK RADICAL REPEAT
 ⺁ U+2E81 CJK RADICAL CLIFF
 ⺂ U+2E82 CJK RADICAL SECOND ONE
 ⺃ U+2E83 CJK RADICAL SECOND TWO
 ⺄ U+2E84 CJK RADICAL SECOND THREE
 ⺅ U+2E85 CJK RADICAL PERSON
 ⺆ U+2E86 CJK RADICAL BOX
 ⺇ U+2E87 CJK RADICAL TABLE
 ⺈ U+2E88 CJK RADICAL KNIFE ONE
...

* — uniprops and unichars are available from the Unicode::Tussle distro.

How Can I Run a Regex that Tests Text for Characters in a Particular Alphabet or Script?

2 Answers2