Regular expression to catch letters beyond a-z

Question

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

score 14 · Accepted Answer · answered Mar 17 '09 at 21:51

14

You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.

My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.

answered Mar 17 '09 at 21:51

Richard Szalay

83,269
19
178
237

5

For those who are not so familar wit regex (like me), the actual correct code is: \p{Ll} – Run CMD Feb 11 '10 at 15:29
To match letters use `\p{L}`. To match letters with diacritics, use `(?>\p{L}\p{M}*)`. To match uppercase letters, use `\p{Lu}`. To match lowercase letters - yes - use `\p{Ll}`. – Wiktor Stribiżew Jan 26 '16 at 10:35

score 3 · Answer 2 · answered Mar 17 '09 at 21:47

What about \p{name} ?

Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.

I don't know enough about unicode, but maybe your characters fit a unicode class?

score 2 · Answer 3 · answered Mar 17 '09 at 21:50

2

See character categories selection with \p and \w unicode semantics.

answered Mar 17 '09 at 21:50

MarkusQ

21,814
3
56
68

score 0 · Answer 4 · answered May 31 '16 at 11:20

0

This regex allows only valid symbols through:

[a-zA-ZÀ-ÿ ]

answered May 31 '16 at 11:20

John Wakefield

477
1
4
15

score 0 · Answer 5 · answered Mar 17 '09 at 21:46

0

All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.

The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."

answered Mar 17 '09 at 21:46

Jason Cohen

81,399
26
107
114

I suggested [:alpha:] in an answer I have deleted. I don't know C#, so I am probably wrong, but the regex engines I'm familiar with changes the letters it matches based on locale. – Jon 'links in bio' Ericson Mar 17 '09 at 21:52
@Jon: .net does not support [:name:] for named classes, but has alternate syntax for the same purpose. – Richard Mar 18 '09 at 11:27
@Jason: You would only need to list if you definition of letter differed from Unicde's, and Character Class Subtraction was insufficuent, e.g. [\p{L}-[\p{IsBasicLatin}]] would match all non-ASCII letters. – Richard Mar 18 '09 at 11:29

score 0 · Answer 6 · answered Mar 18 '09 at 11:38

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

This is not, in general, possible.

After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).

Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.

If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.

Regular expression to catch letters beyond a-z

6 Answers6

Linked