How to match Chinese word in Ruby？

Question

I want to match Chinese word in a string, but it failed

irb(main):016:0> "身高455478".scan(/\p{Han}/)
SyntaxError: (irb):16: invalid character property name {Han}: /\p{Han}/
    from C:/Program Files/Ruby-2.1.0/bin/irb.bat:18:in `<main>'

What's wrong with it?

The problem is very strange, is it the character encoding problem?

its working for me though: `2.1.1 :002 > "身高455478".scan(/\p{Han}/) => ["身", "高"]` — aelor, Sep 25 '14 at 07:40
@AvinashRaj: That would (correctly) change the way the regex matches, but it doesn't explain the error. — Tim Pietzcker, Sep 25 '14 at 07:46
irb(main):016:0> "身高455478".scan(/\p{Han}/) SyntaxError: (irb):16: invalid character property name {Han}: /\p{Han}/ from C:/Program Files/Ruby-2.1.0/bin/irb.bat:18:in `
' — bluexuemei, Sep 25 '14 at 07:47
possible dupe http://stackoverflow.com/questions/2727804/how-to-determine-if-a-character-is-a-chinese-character — Avinash Raj, Sep 25 '14 at 08:09

Yu Hao · Answer 1 · 2014-09-25T09:02:12.660

5

I can reproduce the problem in irb. The difference between my Ruby environment and others who can't reproduce the problem is, my encoding in irb is by default GBK which is for Chinese.

This can reproduce the problem:

#encoding:GBK
p "身高455478".scan(/\p{Han}/)

shows error: invalid character property name {Han}: /\p{Han}/

To fix the problem, use the UTF-8 encoding:

#encoding:utf-8
p "身高455478".scan(/\p{Han}/)

Outputs: ["\u8EAB", "\u9AD8"]

As @Stefan suggests, to set irb to use UTF-8 encoding, start irb using irb -E UTF-8.

To encode this one string, use String#encode:

'身高455478'.encode('utf-8').scan(/\p{Han}/u)
#=> ["\u8EAB", "\u9AD8"]

edited Sep 25 '14 at 09:02

answered Sep 25 '14 at 08:03

Yu Hao

119,891
44
235
294

I had thought that modern Ruby has UTF-8 by default. That is not the case for irb? – sawa Sep 25 '14 at 08:08
@sawa I'm in China, I guess irb reads environment from my machine and automatically sets the encoding to `GBK`. To be honest, it bothers me sometimes. – Yu Hao Sep 25 '14 at 08:13
@Yu Hao but #encoding:utf-8 doesn't work in irb,how to do in irb? – bluexuemei Sep 25 '14 at 08:15
Start irb with `-E`, e.g. `irb -E UTF-8` – Stefan Sep 25 '14 at 08:49
Otherwise, set your encodings explicitly: `'身高455478'.encode('UTF-8').scan(/\p{Han}/u)` – Stefan Sep 25 '14 at 08:52
Or, perhaps you can write something to `.irbrc` or something. – sawa Sep 25 '14 at 09:01
Ruby tries to detect the locale from your environment, so setting a UTF-8 compatible locale like `zh_CN.UTF-8` should work. – Stefan Sep 25 '14 at 09:19
@Stefan，i tried -E UTF-8 and zh_CN.UTF-8,but it doesn't work,can you help me? – bluexuemei Sep 25 '14 at 10:26
@user3673267 `zh_CN.UTF-8` refers to your system's locale, it's not an option for irb's `-E` switch. Maybe you have to set your terminal's character encoding to UTF-8 as well. Just guessing, I don't have a GBK environment here. – Stefan Sep 25 '14 at 10:57

How to match Chinese word in Ruby？

1 Answers1