Match only unicode letters

Question

i have the following regex that allows only alphabets :

     /[a-zA-Z]+/

     a = "abcDF"
     if (a.match(/[a-zA-Z]+/) == a){
        //Match
     }else{
        //No Match
     }

How can I do this using p{L} (universal - any language like german, english etc.. )

What I tried :

  a.match(/[p{l}]+/)
  a.match(/[\p{l}]+/)
  a.match(/p{l}/)
  a.match(/\p{l}/)

but all returned null for the letter a = "aB"

Tim Pietzcker · Accepted Answer · 2018-03-09T06:17:38.247

13

Starting with ECMAScript 2018, JavaScript finally supports Unicode property escapes natively.

For older versions, you either need to define all the relevant Unicode ranges yourself. Or you can use Steven Levithan's XRegExp package with Unicode add-ons and utilize its Unicode property shortcuts:

var regex = new XRegExp("^\\p{L}*$")
var a = "abcäöüéèê"
if (regex.test(a)) {
    // Match
} else {
    // No Match
}

edited Mar 09 '18 at 06:17

answered Nov 03 '12 at 14:57

Tim Pietzcker

328,213
58
503
561

I dont have problem using package, but just tell, is it mandatory to use package for checking different languageslike german, english etc.. – user1767962 Nov 03 '12 at 14:59
someone told me that \w matches for any language, is it true ? – user1767962 Nov 03 '12 at 15:00
1

`\w` only matches ASCII letters/digits/underscore in JavaScript. There is no easy way around XRegExp if you want to support Unicode. – Tim Pietzcker Nov 03 '12 at 15:01
@user1767962: And that's going to be hard, because you'll find German words that use accented letters, English words that use "umlauts" (trema) etc., so there isn't a clear boundary between languages and their "allowed" character sets. – Tim Pietzcker Nov 03 '12 at 15:03
1

Invalid escape character error for ^\\p{L}*$ bcoz of two backslashes . Is it a typo ? – user1767962 Nov 03 '12 at 15:09
@user1767962: No, in a string you need to backslashes for each literal backslash. But you do need to use the Unicode add-on, I forgot that: http://xregexp.com/plugins/ – Tim Pietzcker Nov 03 '12 at 15:11
Doesn't work with `\p{IsCyrillic}` regex. Java does. – izogfif Jan 29 '19 at 07:12
@izogfif: Yes, as specified in the first document I linked to. This was a conscious decision in order to avoid ambiguity. Some languages simply ignore any `is` prefix, making `isGreek` an alternative to `Greek`, but that may lead to problems (`Isolation`?), so it was decided not to support this. – Tim Pietzcker Jan 29 '19 at 08:27
@TimPietzcker is there going to be some conversion library / method / parameter so I can simply copy-paste regexps from the existing, working and tested Java code and use them in JavaScript? It's tedious to keep two copies of the same regexps in two places. – izogfif Jan 29 '19 at 11:36
Not that I know of. RegexBuddy can convert regexes between flavors, and although it doesn't yet support the new JS regex features, it can for example convert from Java to PCRE and give the desired result. – Tim Pietzcker Jan 29 '19 at 11:40

score 6 · Answer 2 · answered May 08 '15 at 18:30

If you are willing to use Babel to build your javascript then there's a babel-plugin I have released which will transform regular expressions like /^\p{L}+$/ or /\p{^White_Space}/ into a regular expression that browsers will understand.

This is the project page: https://github.com/danielberndt/babel-plugin-utf-8-regex

score 4 · Answer 3 · answered Aug 18 '20 at 13:31

You may use \p{L} with the modern ECMAScript 2018+ compliant JavaScript environments, but you need to remember that the Unicode property classes are only supported when you pass u modifier/flag:

a.match(/\p{L}+/gu)
a.match(/\p{Alphabetic}+/gu)

will match all occurrences of 1 or more Unicode letters in the a string.

NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. Ⅻ – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).

There are some things to bear in mind though when using u modifier with a regex:

You can use Unicode code point escape sequences such as \u{1F42A} for specifying characters via code points. Normal Unicode escapes such as \u03B1 only have a range of four hexadecimal digits (which equals the basic multilingual plane) (source)
"Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters" (source)
Escaping requirements to patterns compiled with u flag are more strict: you can't escape any special characters, you can only escape those that can actually behave as special characters. See HTML input pattern not working.

This works in Chrome 89 though `if ('ıi和平'.match(/\p{Alphabetic}+/gu)) {console.log('true!');} else {console.log('false!');}` doesn't seem to work in Waterfox 56, thoughts please? — John, Apr 02 '21 at 08:31
@John If ECMAScript 2018 is not yet supported there, you will need a workaround, as described [here](https://stackoverflow.com/a/37668315/3832970). — Wiktor Stribiżew, Apr 02 '21 at 08:58

Match only unicode letters

3 Answers3

Linked

Related