3

I want to match Chinese word in a string, but it failed

irb(main):016:0> "身高455478".scan(/\p{Han}/)
SyntaxError: (irb):16: invalid character property name {Han}: /\p{Han}/
    from C:/Program Files/Ruby-2.1.0/bin/irb.bat:18:in `<main>'

What's wrong with it?

The problem is very strange, is it the character encoding problem?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
bluexuemei
  • 233
  • 3
  • 12

1 Answers1

5

I can reproduce the problem in irb. The difference between my Ruby environment and others who can't reproduce the problem is, my encoding in irb is by default GBK which is for Chinese.

This can reproduce the problem:

#encoding:GBK
p "身高455478".scan(/\p{Han}/)

shows error: invalid character property name {Han}: /\p{Han}/

To fix the problem, use the UTF-8 encoding:

#encoding:utf-8
p "身高455478".scan(/\p{Han}/)

Outputs: ["\u8EAB", "\u9AD8"]


As @Stefan suggests, to set irb to use UTF-8 encoding, start irb using irb -E UTF-8.

To encode this one string, use String#encode:

'身高455478'.encode('utf-8').scan(/\p{Han}/u)
#=> ["\u8EAB", "\u9AD8"]
Yu Hao
  • 119,891
  • 44
  • 235
  • 294
  • I had thought that modern Ruby has UTF-8 by default. That is not the case for irb? – sawa Sep 25 '14 at 08:08
  • @sawa I'm in China, I guess irb reads environment from my machine and automatically sets the encoding to `GBK`. To be honest, it bothers me sometimes. – Yu Hao Sep 25 '14 at 08:13
  • @Yu Hao but #encoding:utf-8 doesn't work in irb,how to do in irb? – bluexuemei Sep 25 '14 at 08:15
  • Start irb with `-E`, e.g. `irb -E UTF-8` – Stefan Sep 25 '14 at 08:49
  • Otherwise, set your encodings explicitly: `'身高455478'.encode('UTF-8').scan(/\p{Han}/u)` – Stefan Sep 25 '14 at 08:52
  • Or, perhaps you can write something to `.irbrc` or something. – sawa Sep 25 '14 at 09:01
  • Ruby tries to detect the locale from your environment, so setting a UTF-8 compatible locale like `zh_CN.UTF-8` should work. – Stefan Sep 25 '14 at 09:19
  • @Stefan,i tried -E UTF-8 and zh_CN.UTF-8,but it doesn't work,can you help me? – bluexuemei Sep 25 '14 at 10:26
  • @user3673267 `zh_CN.UTF-8` refers to your system's locale, it's not an option for irb's `-E` switch. Maybe you have to set your terminal's character encoding to UTF-8 as well. Just guessing, I don't have a GBK environment here. – Stefan Sep 25 '14 at 10:57