Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

Question

I need to validate name,saved in a String, which can be in any language with spaces using \p{L}:

You can match a single character belonging to the "letter" category with \p{L}

I tried to use String.matches, but it failed to match non English characters, even for 1 character, for example

String name = "อั";
boolean isMatch = name.matches("[\\p{L}]+")); // return false

I tried with/without brackets, adding + for multiple letters, but it's always failing to match non English characters

Is there an issue using String.matches with \p{L}?

I failed also using [\\x00-\\x7F]+ suggested in Pattern

\p{ASCII} All ASCII:[\x00-\x7F]

@CarlosHeuberger no, but even when using \pL on a one character match, it still fail — Ori Marko, Jun 02 '19 at 12:28

Wiktor Stribiżew · Accepted Answer · 2019-06-03T08:10:31.903

You should bear in mind that Java regex parses strings as collections of Unicode code units, not code points. \p{L} matches any Unicode letter from the BMP plane, it does not match letters glued with diacritics after them.

Since your input can contain letters and diacritics you should at least use both \p{L} and \p{M} Unicode property classes in your character class:

String regex = "[\\p{L}\\p{M}]+";

If the input string can contain words separated with whitespaces, you may add \s shorthand class and to match any kind of whitespace you may compile this regex with Pattern.UNICODE_CHARACTER_CLASS flag:

String regex = "(?U)[\\p{L}\\p{M}\\s]+";

Note that this regex allows entering diacritics, letters and whitespaces in any order. If you need a more precise regex (e.g. diacritics only allowed after a base letter) you may consider something like

String regex = "(?U)\\s*(?>\\p{L}\\p{M}*+)+(?:\\s+(?>\\p{L}\\p{M}*+)+)*\\s*";

Here, (?>\\p{L}\\p{M}*+)+ matches one or more letters each followed with zero or more diacritics, \s* matches zero or more whitespaces and \s+ matches 1 or more whitespaces.

\p{IsAlphabetic} vs. [\p{L}\p{M}]

If you check the source code, \p{Alphabetic} checks if Character.isAlphabetic(ch) is true. It is true if the char belongs to any of the following classes: UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER, LETTER_NUMBER or it has contributory property Other_Alphabetic. It is derived from Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic.

While all those L subclasses form the general L class, note that Other_Alphabetic also includes Letter number Nl class, and it includes more chars than \p{M} class, see this reference (although it is in German, the categories and char names are in English).

So, \p{IsAlphabetic} is broader than [\p{L}\p{M}] and you should make the right decision based on the languages you want to support.

Thank you,why is it better then `\p{IsAlphabetic}`? – Ori Marko Jun 03 '19 at 03:05 — Ori Marko, Jun 03 '19 at 03:05
@user7294900 I added some more details. – Wiktor Stribiżew Jun 03 '19 at 08:10 — Wiktor Stribiżew, Jun 03 '19 at 08:10

Ori Marko · Answer 2 · 2019-06-02T12:54:02.983

1

The only solution I found is using \p{IsAlphabetic}

\p{Alpha} An alphabetic character:\p{IsAlphabetic}

boolean isMatch = name.matches("[ \\p{IsAlphabetic}]+"))

Which doesn't work in sites as https://regex101.com/ in demo

edited Jun 02 '19 at 12:54

answered Jun 02 '19 at 12:32

Ori Marko

56,308
23
131
233

ICE · Answer 3 · 2019-06-02T13:50:38.567

1

Googled that character to find the language. Seems to be Thai. Thai Unicode character range is: 0E00 to 0E7F:

When you are working with unicode characters you can use \u. So, the regex should be look like this:

[\u0E00-\u0E7F]

Which is match in this REGEX test with your character.

If you want to match any languages use this:

[\p{L}]

Which is match in this REGEX test with your example characters.

edited Jun 02 '19 at 13:50

answered Jun 02 '19 at 13:19

ICE

1,667
2
21
43

I need English and non English characters, not only Thai, but thanks for the Thai reference – Ori Marko Jun 02 '19 at 13:22

Mike Samuel · Answer 4 · 2019-06-02T13:43:52.973

1

There are two characters there. The first is a letter, the second is a non-letter mark.

String name = "\u0e2d";
boolean isMatch = name.matches("[\\p{L}]+"); // true

works, but

String name = "\u0e2d\u0e31";
boolean isMatch = name.matches("[\\p{L}]+"); // false

does not because ั U+E31 is a Non-Spacing Mark [NSM], not a letter.

edited Jun 02 '19 at 13:43

answered Jun 02 '19 at 13:26

Mike Samuel

118,113
30
216
245

I'll check it, but it's also from user input and also why/how `IsAlphabetic` is working in this case? – Ori Marko Jun 02 '19 at 13:37
@user7294900. Sorry. My mistake. I'm seeing a diacritical now ั [U+E31](http://www.fileformat.info/info/unicode/char/e31/index.htm) – Mike Samuel Jun 02 '19 at 13:41
The logical conclusion being that `"[\\p{L}\\p{M}]+"` will correctly match that string. – VGR Jun 02 '19 at 15:51

score 1 · Answer 5 · answered Jun 02 '19 at 14:01

1

Try including more categories:

[\p{L}\p{Mn}\p{Mc}\p{Nl}\p{Pc}\p{Pd}\p{Po}\p{Sk}]+

Note that it might be best to simply not validate names. People can't really complain if they entered it wrong but your system didn't catch it. However, it's much more of a problem if someone is unable to enter their name. If you do insist on adding validation, please make it overridable: that should have the advantages of each method without their disadvantages.

answered Jun 02 '19 at 14:01

Solomon Ucko

5,724
3
24
45

I must validate user input, can you explain the categories? Can you add reference link/demo? – Ori Marko Jun 02 '19 at 14:03
@user7294900 I used https://en.wikipedia.org/wiki/Unicode_character_property and https://www.compart.com/en/unicode/category to look up the categories. – Solomon Ucko Jun 02 '19 at 14:04
Thank you for responding, it works, but it's adding alot of questions, for example why adding *\p{Pd} matches any kind of hyphen or dash \p{Po} matches any kind of punctuation character that is not a dash, bracket, quote or connector*? – Ori Marko Jun 03 '19 at 05:03
@user7294900 If you're asking why the Unicode standard was designed the way it was, I don't have any answers. I also think it's a mess. If you're asking something, could you please clarify your question? – Solomon Ucko Jun 03 '19 at 10:37

Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

5 Answers5