Regex matching letter characters

Question

I've this regex:

if (cadena.matches("^[a-zA-Z ]+$")) return true;

It's accepting from A to Z as lowercase and uppercase. Also accepting spaces.

But this is working just for english. For instance, in Catalan we've the 'ç' character. Also we've characters with 'á', or 'à', etc.

Did some google and I couldn't find any way to do this.

I found out that I can filter for UTF-8 but this would accept characters that are not really a letter.

How can I implement this?

Take a look at [Unicode blocks](http://jregex.sourceforge.net/gstarted.html#appendix-c). — Linus Kleen, Jun 07 '13 at 09:38
Dunno if this helps: http://stackoverflow.com/questions/896374/what-is-the-regular-expression-for-a-spanish-word more relevant: http://stackoverflow.com/questions/6548815/how-do-i-match-latin-unicode-characters-in-coldfusion-or-java-regex?rq=1 — wazy, Jun 07 '13 at 09:40
Also have a look at [stackoverflow.com/questions/9499851/...](http://stackoverflow.com/questions/9499851/regex-for-validating-alphabetics-and-numbers-in-the-localized-string/9500409#9500409) — stema, Jun 07 '13 at 10:27

score 26 · Accepted Answer · answered Jun 07 '13 at 09:42

26

Use this regex:

[\p{L}\s]+

\p{L} means any Unicode letter.

answered Jun 07 '13 at 09:42

mvp

5

Doesn't this match non-Latin characters as well, which is not exactly what the OP was looking for (even though they did accept this answer)? It matches `안녕`, for example. It seems like `\p{IsLatin}` is a better fit if you specifically want to match Latin characters ([ref](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html)). – Nick Chammas Nov 20 '16 at 03:05
@NickChammas: op explicitly wanted any Unicode letters matched. – mvp Nov 20 '16 at 03:31
At this point 3 years after the fact I suppose it's a moot point but the OP's title and intended use case of matching the Catalan alphabet suggest they want to just match Latin characters and not all Unicode (which would include other alphabets, like Korean). I don't see where the OP explicitly wanted to match "any Unicode letter". But anyway, I upvoted this answer because it was helpful. I hope my earlier comment helps others who, like me, came to this page looking for a way to match just Latin characters and not any Unicode. – Nick Chammas Nov 20 '16 at 16:43
3

To elaborate a bit, in case this is a point of confusion, Latin != ASCII. Most Latin characters, like `ë`, `ɶ`, or `ṧ`, can only be [represented by Unicode](https://en.wikipedia.org/wiki/Latin_script_in_Unicode). `\p{IsLatin}` will match those characters without matching characters from other, non-Latin alphabets. – Nick Chammas Nov 20 '16 at 17:42
best solution – yildirimosman Apr 12 '20 at 22:56

Uwe Plonus · Answer 2 · 2019-03-29T18:15:51.557

-2

Look at the documentation and use a class (e.g. \p{InLATIN_1_SUPPLEMENT}).

edited Mar 29 '19 at 18:15

answered Jun 07 '13 at 09:42

Uwe Plonus

2

This documentation page does NOT have `Latin1Supplemental` mentioned anywhere. Even googling for `Latin1Supplemental` at `site:oracle.com` does not find it. What gives? – mvp Jun 07 '13 at 09:52
It could be named something different. Please check the documentation for `Character.UnicodeBlock`. There is a constant named `LATIN_!_SUPPLEMENTAL` which name could be used for the `\p{}` name. – Uwe Plonus Jun 07 '13 at 09:56
This should be: `Pattern.compile("\\p{InLATIN_1_SUPPLEMENT}")`. Mind the `In` preceding the `Character.UnicodeBlock` constant. From "Mastering Regular Expressions": "Unicode blocks are supported, requiring an ‘In’ prefix." – Stefan van den Akker Mar 28 '19 at 10:30
@StefanvandenAkker You are right. I corrected my answer. – Uwe Plonus Mar 28 '19 at 18:40
1

@UwePlonus Sorry, that still doesn't compile. It should be either `\p{InLATIN_1_SUPPLEMENT}`, `\p{InLATIN-1 SUPPLEMENT}` or `\p{InLATIN-1SUPPLEMENT}` as per the `idName` and `aliases` taken from `Character.UnicodeBlock.LATIN_1_SUPPLEMENT`. – Stefan van den Akker Mar 29 '19 at 15:37

2 Answers2