You should bear in mind that Java regex parses strings as collections of Unicode code units, not code points. \p{L}
matches any Unicode letter from the BMP plane, it does not match letters glued with diacritics after them.
Since your input can contain letters and diacritics you should at least use both \p{L}
and \p{M}
Unicode property classes in your character class:
String regex = "[\\p{L}\\p{M}]+";
If the input string can contain words separated with whitespaces, you may add \s
shorthand class and to match any kind of whitespace you may compile this regex with Pattern.UNICODE_CHARACTER_CLASS
flag:
String regex = "(?U)[\\p{L}\\p{M}\\s]+";
Note that this regex allows entering diacritics, letters and whitespaces in any order. If you need a more precise regex (e.g. diacritics only allowed after a base letter) you may consider something like
String regex = "(?U)\\s*(?>\\p{L}\\p{M}*+)+(?:\\s+(?>\\p{L}\\p{M}*+)+)*\\s*";
Here, (?>\\p{L}\\p{M}*+)+
matches one or more letters each followed with zero or more diacritics, \s*
matches zero or more whitespaces and \s+
matches 1 or more whitespaces.
\p{IsAlphabetic}
vs. [\p{L}\p{M}]
If you check the source code, \p{Alphabetic}
checks if Character.isAlphabetic(ch)
is true. It is true if the char belongs to any of the following classes: UPPERCASE_LETTER
, LOWERCASE_LETTER
, TITLECASE_LETTER
, MODIFIER_LETTER
, OTHER_LETTER
, LETTER_NUMBER
or it has contributory property Other_Alphabetic. It is derived from Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic
.
While all those L
subclasses form the general L
class, note that Other_Alphabetic
also includes Letter number Nl
class, and it includes more chars than \p{M}
class, see this reference (although it is in German, the categories and char names are in English).
So, \p{IsAlphabetic}
is broader than [\p{L}\p{M}]
and you should make the right decision based on the languages you want to support.