8

I'd like to make a regex in Perl that will test a string for characters in a particular script. This would be something like:

$text =~ .*P{'Chinese'}.*

Is there a simple way of doing this, for English it's pretty easy by just testing for [a-zA-Z], but for a script like Chinese, or one of the Japanese scripts, I can't figure out any way of doing this short of writing out every character explicitly, which would make for some very ugly code. Ideas? I can't be the first/only person that's wanted to do this.

Eli
  • 36,793
  • 40
  • 144
  • 207
  • [This](http://stackoverflow.com/questions/4611425/how-to-count-the-chinese-word-in-a-file-using-regex-in-perl) seems helpful. – TLP Nov 30 '11 at 22:34
  • 1
    Related: http://stackoverflow.com/questions/6937087/detect-chinese-character-using-perl#6939500 – daxim Dec 01 '11 at 11:40

2 Answers2

10

Look at perldoc perluniprops, which provides an exhaustive list of properties you can use with \p. You’ll be interested in \p{CJK_Unified_Ideographs} and related properties such as \p{CJK_Symbols_And_Punctuation}. \p{Hiragana} and \p{Katakana} give you the kana. There is also a \p{Script=...} property for a number of scripts: \p{Han} and \p{Script=Han} match Han characters (Chinese), but there is no corresponding \p{Script=Japanese}, quite simply because Japanese has multiple scripts.

Jon Purdy
  • 53,300
  • 8
  • 96
  • 166
  • I thought [Hiragana](http://en.wikipedia.org/wiki/Hiragana) and [Katakana](http://en.wikipedia.org/wiki/Katakana) were just used by in Japanese. – ikegami Dec 01 '11 at 01:42
  • @ikegami: Right. The OP mentioned Japanese. – Jon Purdy Dec 01 '11 at 01:45
  • `\p{Kana_Supplement}` is incorrect and deprecated. It should be `\p{Block:KanaSupplement}` aka `\p{InKanaSupplement}`. The block only contains two defined characters, and they are either in Hiragana or Katakana as well. One needs not worry about this block. – ikegami Dec 01 '11 at 02:04
4

There are two ways of doing that. By block (\p{Block=...}) and by script (\p{Script=...}). The latter is probably more natural.

I don't know much about Chinese languages, but I think you want \p{Script=Han} aka \p{Han} for Chinese.

Japanese uses three scripts:

  • Kanij: \p{Script=Han} aka \p{Han}
  • Hiragana: \p{Script=Hiragana} aka \p{Hiragana} aka \p{Hira}
  • Katakana: \p{Script=Katakana} aka \p{Katakana} aka \p{Kana}

You could take a look at perluniprops to find the one you are looking for, or you could use uniprops* to find which properties match a specific character.

$ uniprops 4E2D
U+4E2D ‹中› \N{CJK UNIFIED IDEOGRAPH-4E2D}
    \w \pL \p{L_} \p{Lo}
    All Any Alnum Alpha Alphabetic Assigned InCJK_UnifiedIdeographs
    CJK_Unified_Ideographs L Lo Gr_Base Grapheme_Base Graph GrBase
    Han Hani ID_Continue IDC ID_Start IDS Ideo Ideographic Letter
    L_ Other_Letter Print UIdeo Unified_Ideograph Word XID_Continue
    XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph
    X_POSIX_Print X_POSIX_Word

To find out which characters are in a given property, you can use unichars*. (This is of limited usefulness since most CJK chars aren't named.)

$ unichars -au '\p{Han}'
 ⺀ U+2E80 CJK RADICAL REPEAT
 ⺁ U+2E81 CJK RADICAL CLIFF
 ⺂ U+2E82 CJK RADICAL SECOND ONE
 ⺃ U+2E83 CJK RADICAL SECOND TWO
 ⺄ U+2E84 CJK RADICAL SECOND THREE
 ⺅ U+2E85 CJK RADICAL PERSON
 ⺆ U+2E86 CJK RADICAL BOX
 ⺇ U+2E87 CJK RADICAL TABLE
 ⺈ U+2E88 CJK RADICAL KNIFE ONE
...

* — uniprops and unichars are available from the Unicode::Tussle distro.

ikegami
  • 367,544
  • 15
  • 269
  • 518