20

I am doing internationalization in Struts. I want to write Javascript validation for Japanese and English users. I know regular expression for English but not for Japanese users. Is it possible to write one regular expression for both the users which validate on the basis of Unicode?

Please help me.

dda
  • 6,030
  • 2
  • 25
  • 34
Nilesh Shukla
  • 309
  • 2
  • 5
  • 12
  • 1
    please remove leading spaces on your paragraph... it interpreted as code, looks ugly – zb' Jul 22 '11 at 08:57

2 Answers2

45

Here is a regular expression that can be used to match all English alphanumeric characters, Japanese katakana, hiragana, multibytes of alphanumerics (hankaku and zenkaku), and dashes:

/[一-龠]+|[ぁ-ゔ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[a-zA-Z0-9]+|[々〆〤ヶ]+/u

You can edit it to fit your needs, but notice the "u" flag at the end.

Arzet Ro
  • 487
  • 1
  • 5
  • 12
shawndreck
  • 2,039
  • 1
  • 24
  • 30
  • 1
    Thanks! This helped me solve a problem as to why iPhone wasn't allowing a regex when using a unicode style escape ( \u3040-\u309F ), but you may want to change `ぁ-ん` to `ぁ-ゔ` , as ゔ comes after ん. – n_b Oct 17 '12 at 01:10
  • 2
    And just today found out a few more characters that aren't included! `々〆〤`, unicode 3005, 3006, and 3024 respectively. 3005 is probably the most important, as it is used in words like 代々木 and 時々 – n_b Oct 31 '12 at 12:31
  • 1
    @shawndreck This is working fine... but it is not allowing kanji characters ex: 漢字 then how can we make use of kanji as input... I don't know Japanese language. pls let me know ... – Sankar M Dec 02 '13 at 13:59
  • 1
    @Shankar , I'm not sure I got your question. What do you mean by "not allowing kanji characters"? How are you doing it, and what results are you exepcting ? – shawndreck Dec 02 '13 at 14:10
  • 2
    @shawndreck very simple.. i need to allow kanji characters. so how can i modify your suggested expression above to allow kanji characters? – Sankar M Dec 03 '13 at 13:23
  • 1
    I used this regex (based on shawndreck's answer) to whitelist only japanese characters and it worked fine: `[一-龠ぁ-ゔァ-ヴーa-zA-Z0-9々〆〤]+` – Shahar Dec 26 '13 at 13:25
  • @shankar, you are missing some classes in there. For instance, the full-width alphabets[a-z]. not sure if you can realise the difference from what I typed. Simply put, that is why it appears there is/are duplicates in the classes but they actually targeting half-width and full-width alphabets/characters. Really hope it makes sense! – shawndreck Dec 26 '13 at 21:52
  • @shawndreck what does the /u represent ? – Anto S Jul 08 '15 at 07:20
  • @version.beta I believe u flag makes the RegEx engine treat the input string as UTF-8. – shawndreck Aug 07 '15 at 16:28
  • Thought this might help others: [regex online example](https://regex101.com/r/3y8QMe/1) – Automate This Mar 11 '20 at 03:00
  • 1
    It would be vastly faster to use a single regex character class rather than six alternated character classes: `/[一-龠ぁ-ゔァ-ヴーa-zA-Z0-9a-zA-Z0-9々〆〤]+/u`. You'll get the same output. – Adam Katz Jul 16 '21 at 16:05
-1

Provided your text editor and programming language support Unicode, you should be able to enter Japanese characters as literal strings. Things like [A-X] ranges will probably not translate very well in general.

What kind of text are you trying to validate?

What language are the regular experssions in? Perl-compatible, POSIX, or something else?

spraff
  • 32,570
  • 22
  • 121
  • 229