8

The thing is I need to treat this kind of Chinese input as invalid in client side validation:

Input is invalid when any English character mixed with any Chinese character and spaces has a total length >=10.

Let's say : "你的a你的a你的a你" or "你的 你的 你的 你" (length is 10) is invalid. But "你的a你的a你的a" (length is 9) is OK.

I am using both Javascript to do client side validation and Java to do the server side. So I suppose applying the regular expression on both should be perfect.

Can anyone give some hints how to write the rules in regular expression?

Mariano
  • 6,423
  • 4
  • 31
  • 47
jm li
  • 303
  • 2
  • 9
  • 18

1 Answers1

18

From What's the complete range for Chinese characters in Unicode?, the CJK unicode ranges are:

Block                                   Range       Comment
--------------------------------------- ----------- ----------------------------------------------------
CJK Unified Ideographs                  4E00-9FFF   Common
CJK Unified Ideographs Extension A      3400-4DBF   Rare
CJK Unified Ideographs Extension B      20000-2A6DF Rare, historic
CJK Unified Ideographs Extension C      2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D      2B740–2B81F Uncommon, some in current use
CJK Unified Ideographs Extension E      2B820–2CEAF Rare, historic
CJK Compatibility Ideographs            F900-FAFF   Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement 2F800-2FA1F Unifiable variants
CJK Symbols and Punctuation             3000-303F

You probably want to allow code points from the Unicode blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A.

This regex will match 0 to 9 spaces, ideographic spaces (U+3000), A-Z letters, or code points in those 2 CJK blocks.

/^[ A-Za-z\u3000-\u303F\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/

The ideographs are listed in:

However, you may as well add more blocks.


Code:

function has10OrLessCJK(text) {
    return /^[ A-Za-z\u3000-\u303F\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/.test(text);
}

function checkValidation(value) {
    var valid = document.getElementById("valid");
    if (has10OrLessCJK(value)) {
        valid.innerText = "Valid";
    } else {
        valid.innerText = "Invalid";
    }
}
<input type="text" 
       style="width:100%"
       oninput="checkValidation(this.value)"
       value="你的a你的a你的a">

<div id="valid">
    Valid
</div>
weakish
  • 28,682
  • 5
  • 48
  • 60
Mariano
  • 6,423
  • 4
  • 31
  • 47
  • Thx. this really give me some hints. I have just updated the question for a much clear description . Can you please take a look and advise ? – jm li Oct 18 '16 at 10:50
  • @jmli I edited the answer to include letters A-Z and a-z. Notice now it will consider an empty string as valid. Also, it won't allow 0-9 numbers or punctuation such as `a!b-c(d)3` (considered invalid). – Mariano Oct 18 '16 at 11:12
  • it's helpful. Apart from the exisiting rules, if only Chinese input(without any digit nor English char) is considered as valid. Is it possible to define an "or" checking by using Regex? – jm li Oct 18 '16 at 12:51
  • I mean pure Chinese without any length constriction – jm li Oct 18 '16 at 12:54
  • @jmli **[Alternation](http://www.regular-expressions.info/alternation.html)**: `/patternA|patternB/`... E.g. `/^[ \u3000\u3400-\u4DBF\u4E00-\u9FFF]+$|^[ A-Za-z\u3000\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/` – Mariano Oct 18 '16 at 12:59
  • @Mariano Where is the question mark like char located this one ?. I need to add that to the validation? – Helen Araya Jun 11 '20 at 20:04
  • @AmeteBlessed the FULLWIDTH QUESTION MARK (U+FF1F) is included in the [Halfwidth and Fullwidth Forms Block](https://www.fileformat.info/info/unicode/block/halfwidth_and_fullwidth_forms/list.htm) (FF01 - FFEE)... Yes, you can add that character to the regex as `\uFF1F`, or the whole block as `\uFF01-\uFFEE`. – Mariano Jun 13 '20 at 04:03
  • I would recommend that the table be removed from this answer, and referenced in the original question, as it has been updated, and the version represented here has errors. – Calion Oct 30 '22 at 15:01