4

I am trying to make a regex that only allows chars A-Z + ints 0 - 9 together with dash - and underscore _ but also Japanese chars.

$.validator.addMethod("alphaDash", function(value, element) {
        return this.optional(element) || /^[a-zA-Z0-9-_]+$/i.test(value);
      }, "Username must contain only letters, numbers, dashes or underscores.");

The regex above /^[a-zA-Z0-9-_]+$/ only works for english chars, how can I make it accept japanese chars? Hiragana/Katakana/Kanji

Penny Liu
  • 15,447
  • 5
  • 79
  • 98
Kiow
  • 870
  • 4
  • 18
  • 32
  • See [Check whether a string contains Japanese/Chinese characters](http://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters). – Wiktor Stribiżew Apr 27 '17 at 11:56
  • FWIW, the `XRegExp` lib is pretty darned cool: http://xregexp.com/plugins/#unicode – T.J. Crowder Apr 27 '17 at 11:56
  • Does [`^[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9fa-zA-Z0-9-_]+$`](https://regex101.com/r/pLoL5S/1) work for you? – Wiktor Stribiżew Apr 27 '17 at 12:00
  • @WiktorStribiżew Oh yes unix code, that should work! – Kiow Apr 27 '17 at 12:03
  • Watch out since these are script ranges, they do not match just letters/digits. Perhaps, you really need to use XRegExp and its `\pL` and `\pN` constructs to match any Unicode letter and digit. – Wiktor Stribiżew Apr 27 '17 at 12:07
  • @WiktorStribiżew I tried the lib with this: ``/[a-zA-Z0-9-_\p{Hiragana}\p{Katakana}]+$/`` but it fails If my string ends with a Hiragana or Katakana char which I dont want – Kiow Apr 27 '17 at 12:31
  • Could you please share the string you tested against? – Wiktor Stribiżew Apr 27 '17 at 12:33
  • @WiktorStribiżew **werえ** will fail, **werえ3** will pass – Kiow Apr 27 '17 at 12:34
  • [I got *true* in both cases](https://jsfiddle.net/x455p6hq/). – Wiktor Stribiżew Apr 27 '17 at 12:36
  • @WiktorStribiżew my code ``$.validator.addMethod("alphaDash", function(value, element) { return this.optional(element) || /[a-zA-Z0-9-_\p{Hiragana}\p{Katakana}]+$/i.test(value); }, "Username must contain only letters, numbers, dashes or underscores.");`` – Kiow Apr 27 '17 at 12:37
  • Sorry, you are doing it all wrong. You cannot use Unicode properties like `\p{Han}` (this matches all Chinese chars) with JS native `RegExp`. You must reference the `XRegExp` library. – Wiktor Stribiżew Apr 27 '17 at 12:39
  • @WiktorStribiżew got it to work: ``$.validator.addMethod("alphaDash", function(value, element) { var re = XRegExp('^[a-zA-Z0-9-_\\p{Hiragana}\\p{Katakana}]+$'); return this.optional(element) || re.test(value); }, "Username must contain only letters, numbers, dashes or underscores.");`` – Kiow Apr 27 '17 at 12:52
  • Yes, but `[a-zA-Z0-9_]` = `\w`. Also, don't you need to match Kanji as well? You only included Hiragana & Katakana. – Wiktor Stribiżew Apr 27 '17 at 12:53
  • I added an answer based on that. – Wiktor Stribiżew Apr 27 '17 at 13:11

2 Answers2

3

Acc. to XRegExp Unicode scripts:

  • Hiragana (\p{Hiragana}) char regex: [\u3041-\u3096\u309D-\u309F]|\uD82C\uDC01|\uD83C\uDE00
  • Katakana (\p{Katakana}) char regex: [\u30A1-\u30FA\u30FD-\u30FF\u31F0-\u31FF\u32D0-\u32FE\u3300-\u3357\uFF66-\uFF6F\uFF71-\uFF9D]|\uD82C\uDC00
  • Kanji (\p{Han}): [\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FD5\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1]|\uD87E[\uDC00-\uDE1D]

You may either use XRegExp (which is preferable since the library is constantly updated):

var rx = new XRegExp("^[-\\w\\p{Hiragana}\\p{Katakana}\\p{Han}]+$");
console.log(XRegExp.test("werえ", rx));
console.log(XRegExp.test("werえ3", rx));
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.2.0/xregexp-all.min.js"></script>

Or you may use those ranges to build a regex that you will have to support later:

var pHiragana = "[\\u3041-\\u3096\\u309D-\\u309F]|\\uD82C\\uDC01|\\uD83C\\uDE00";
var pKatakana = "[\\u30A1-\\u30FA\\u30FD-\\u30FF\\u31F0-\\u31FF\\u32D0-\\u32FE\\u3300-\\u3357\\uFF66-\\uFF6F\\uFF71-\\uFF9D]|\\uD82C\\uDC00";
var pHan = "[\\u2E80-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u3005\\u3007\\u3021-\\u3029\\u3038-\\u303B\\u3400-\\u4DB5\\u4E00-\\u9FD5\\uF900-\\uFA6D\\uFA70-\\uFAD9]|[\\uD840-\\uD868\\uD86A-\\uD86C\\uD86F-\\uD872][\\uDC00-\\uDFFF]|\\uD869[\\uDC00-\\uDED6\\uDF00-\\uDFFF]|\\uD86D[\\uDC00-\\uDF34\\uDF40-\\uDFFF]|\\uD86E[\\uDC00-\\uDC1D\\uDC20-\\uDFFF]|\\uD873[\\uDC00-\\uDEA1]|\\uD87E[\\uDC00-\\uDE1D]";
var rx = new RegExp("^([\\w-]|" + pHiragana + "|" + pKatakana + "|" + pHan + ")+$");
console.log(rx.test("werえ"));
console.log(rx.test("werえ3"));
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Here's an example regex which would match Hiragana (unicode 3040-309F): /[a-zA-Z0-9_\u3040-\u309F]+/ http://regexr.com/3frf9

You can alter this to add other dialects/languages. You may want to check out this answer to see some of the other unicode values, or just look them up online elsewhere.

Community
  • 1
  • 1
jas7457
  • 1,712
  • 13
  • 21