1

based on this example What's the complete range for Chinese characters in Unicode?

does the letter "s" belog to this alphabet?

var r = /[\u20000-\u2A6DF]/;
var t = 'sad';
console.log(t.match(r))

outpus ["s"]

Why?

Community
  • 1
  • 1
Gigi Ionel
  • 248
  • 1
  • 10

1 Answers1

2

The regex you have contains astral code points:

Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.

These code points are outside of Basic Multilingual Plane (BMP) that can be used in JavaScript regex (e.g. \u00XD). However, JavaScript regex engine does not support astral code points (with the current ECMAScript implementation, it is already present in ECMAScript6, see Unicode code point escapes).

Thus, the problem arises when JavaScript regex engine tries to interpret the regex pattern: it "sees" \u2000, then 0, then -, then \u2A6D, then F inside your character class. Then, the engine creates a range between 0 and \u2A6D (), which is a very large amount of characters, actually, and all English letters, and a lot more can be matched with this regex.

In the Javascript unicode string, chinese character but no punctuation post, you can find a comprehensive Chinese character regex for JavaScript that consists of possible Unicode code point combinations used in Chinese, but there are a couple of typos in it.

Here is a working snippet:

var r = /(?:[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d])+/g;
var t = '我的中文不好。我是意大利人。你知道吗?';
console.log(t.match(r));
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • well that`s my answer :) – Gigi Ionel Aug 27 '15 at 06:40
  • Looks like it will be possible in the next generation of browsers supporting ECMAScript 6. See [ECMAScript 6 Compatibility Table](https://kangax.github.io/compat-table/es6/). – Wiktor Stribiżew Aug 27 '15 at 06:56
  • In Google Chrome, you can actually use ECMAScript 6 syntax. 1) Go to `chrome://flags/#enable-javascript-harmony` 2) click *Enable*. However, I did not manage to make it use the astral code points with RegExp :( – Wiktor Stribiżew Aug 27 '15 at 08:35