23

Recently I have created a regex, for my PHP code which allows only the letters (including special characters plus spaces), but now I'm having a problem with converting it (?) into the JavaScript compatible regex, here it is: /^[\s\p{L}]+$/u, the problem is the /u modifier at the end of the regex pattern, as the JavaScript doesn't allow such flag.

How can I rewrite this, so it will work in the JavaScript as well?

Is there something to allow only Polish characters: Ł, Ą, Ś, Ć, ...

Toto
  • 89,455
  • 62
  • 89
  • 125
Scott
  • 5,991
  • 15
  • 35
  • 42
  • 3
    Perhaps [this answer](http://stackoverflow.com/a/6381892/558021) will be helpful here. – Lix Oct 15 '12 at 13:49
  • 1
    Are you sure you need the u flag? Have you tried removing it and testing the expression? – cammil Oct 15 '12 at 13:52
  • 1
    @cammil "u" is required so the "\p{L}" is recognized as checking for UTF-8 letters. – Matt S Oct 15 '12 at 13:55

3 Answers3

20

The /u modifier is for unicode support. Support for it was added to JavaScript in ES2015.

Read http://stackoverflow.com/questions/280712/javascript-unicode to learn more information about unicode in regex with JavaScript.


Polish characters:

Ą \u0104
Ć \u0106
Ę \u0118
Ł \u0141
Ń \u0143
Ó \u00D3
Ś \u015A
Ź \u0179
Ż \u017B
ą \u0105
ć \u0107
ę \u0119
ł \u0142
ń \u0144
ó \u00F3
ś \u015B
ź \u017A
ż \u017C

All special Polish characters:

[\u0104\u0106\u0118\u0141\u0143\u00D3\u015A\u0179\u017B\u0105\u0107\u0119\u0142\u0144\u00F3\u015B\u017A\u017C]
Cameron Tacklind
  • 5,764
  • 1
  • 36
  • 45
Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • 1
    One might argue that the modifier isn't needed in any language/environment that properly handles Unicode instead of a mishmash of binary data and actual Unicode text in strings such as PHP. – Joey Oct 15 '12 at 14:02
  • @Joey - The PHP `preg` functions, which are based on PCRE, support Unicode when the `/u` option is appended to the regular expression. – Ωmega Oct 15 '12 at 14:04
  • @Scott - Polish language use latin, so go with ranges `[\u0000-\u007F]` = Basic Latin; `[\u0080-\u00FF]` = Latin-1 Supplement; `[\u0100-\u017F]` = Latin Extended-A; `[\u0180-\u024F]` = Latin Extended-B; ... which together get `[\u0000-\u024F]` to include all latin characters :) – Ωmega Oct 15 '12 at 14:07
  • 1
    Ωmega, I know why the flag is needed in PCRE and fundamentally it's the problem that PHP doesn't have a defined character set for strings, leading to some strings being in some legacy character set, some in UTF-8, some storing even non-text binary data. Environments such as Java or .NET have it far easier in that regard, given that text is always Unicode. – Joey Oct 15 '12 at 14:15
  • 2
    This answer is one of the first results on Google when searching for "regex u flag", so you might want to update it with a preface stating that it has been defined in ES2016 and is now supported by most recent browsers :) – Aaron Aug 25 '16 at 20:44
  • @Ωmega If you only want to catch letters, you could use: `[\u0041-\u005A\u0061-\u007A\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8]` but it certainly doesn't look at neat. – Liggliluff Nov 07 '19 at 15:14
6

JavaScript doesn't have any notion of UTF-8 strings, so it's unlikely that you need the /u flag. (Your strings are probably already in the usual JavaScript form, one UTF-16 code-unit per "character".)

The bigger problem is that JavaScript doesn't support \p{L}, nor any equivalent notation; JavaScript regexes have no awareness of Unicode character properties. See the answers to this StackOverflow question for some ways to approximate it.


Edited to add: If you only need to support Polish letters, then you can write /^[\sa-zA-ZĄĆĘŁŃÓŚŹŻąćęłńóśźż]+$/. The a-z and A-Z parts cover the ASCII letters, and then the remaining letters are listed out individually.

Community
  • 1
  • 1
ruakh
  • 175,680
  • 26
  • 273
  • 307
  • Bad news... so maybe there is something to allow only those Polish characters: `Ł`, `Ą`, `Ś`, `Ć`, `Ę` instead? – Scott Oct 15 '12 at 13:57
  • Scott, if you have a small set of characters you want to allow you can always use a character class. – Joey Oct 15 '12 at 14:03
  • @Joey Yea, generally I would like to additionaly allow only those special characters I mentioned above. – Scott Oct 15 '12 at 14:09
  • In Javascript regexp you can refer to unicode chars like this: `\u0161`. For example this will allow only printable ASCII and Ć: `var newtxt = txt.replace(/[^\u0107\u0020-\u007e]/g, '')` . Unicode codes for your chars find for example here: http://www.fileformat.info/info/unicode/char/107/index.htm – DamirR Oct 15 '12 at 14:36
  • @DamirR: What a bizarre comment. `/\u0107/` is equivalent to `/Ć/`; why on Earth would you prefer the former? – ruakh Oct 15 '12 at 15:30
  • 1
    @ruakh: Life is full of bizarre moments. :) For `/Ć/` to work you MUST save js file in UTF-8. Sometimes, other people might use, change, save your code and they might use other encoding (eg. iso-8859-1). So `/Ć/` will not be saved correctly and script will not work. If you use `/\u0107/` that kind of bugs will be avoided. – DamirR Oct 28 '12 at 13:35
1

As of ES2015, /u is supported in JavaScript. See:

Futago-za Ryuu
  • 147
  • 2
  • 9
  • It's currently not supported by all browsers. – Poul Bak Dec 03 '18 at 04:11
  • @PoulBak It says on [the Mozilla docs](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode#Browser_compatibility) it's supported by all major browsers, unless they got it wrong. – Futago-za Ryuu Dec 08 '18 at 18:47
  • Some versions of Edge will simply crash, if you use it, but I think that has been fixed, so you're probably right (noone use IE any more). – Poul Bak Dec 08 '18 at 20:25