3

In JavaScript we can match individual Unicode codepoints or codepoint ranges by using the Unicode escape sequences, e.g.:

"A".match(/\u0041/) // => ["A"]
"B".match(/[\u0041-\u007A]/) // => ["B"]

But how could we create a regular expression to match a proper name which must include any Unicode "letter" using a JavaScript regular expression? Is there a range of letters? A special regex sequence or character class in JavaScript?

Say my website must validate names that could be in latin based languages as well as Hebrew, Cyrillic, Japanese (Katakana, Hiragana, etc.) is this feasible in JavaScript or is the only sane choice to delegate to a backend language with better Unicode support?

tchrist
  • 78,834
  • 30
  • 123
  • 180
maerics
  • 151,642
  • 46
  • 269
  • 291
  • You may also want to read http://stackoverflow.com/questions/4323386/multi-language-input-validation-with-utf-8-encoding/4324957#4324957 and http://stackoverflow.com/questions/4718266/advice-on-how-to-validate-names-and-surnames-using-regex/4719582#4719582 – ninjalj Apr 06 '11 at 18:34
  • And http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ and http://blog.jgc.org/2010/06/your-last-name-contains-invalid.html – ninjalj Apr 06 '11 at 18:46
  • I really think you should carefully consider your last choice: delegating the backend work to a language that actually supports The Unicode Standard. – tchrist Apr 07 '11 at 12:10

2 Answers2

5

Here's a JS plugin that adds Unicode support to RegEx

http://xregexp.com/plugins/

MadBender
  • 1,438
  • 12
  • 15
0

I am using for defining unicode of a symbols this site http://www.fileformat.info.

Unicode Blocks (Basic Latin, .+, Cyrillic, .+, Arabic and other): http://www.fileformat.info/info/unicode/block/index.htm

Unicode Character Categories (this does not work in JS): http://www.fileformat.info/info/unicode/category/index.htm

Letters (A-я): http://www.fileformat.info/info/unicode/char/a.htm

Fonts (which chars are supported in each font): http://www.fileformat.info/info/unicode/font/index.htm

Index for all above http://www.fileformat.info/info/unicode/index.htm

Dmitrij Golubev
  • 694
  • 4
  • 13
  • 4
    You mustn’t use Unicode blocks as a proxy for Unicode scripts, which is what you really want. The Unicode Standard speaks to this matter specifically. – tchrist Apr 07 '11 at 12:09