11

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

Chad Birch
  • 73,098
  • 23
  • 151
  • 149
Nifle
  • 11,745
  • 10
  • 75
  • 100

6 Answers6

14

You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.

My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.

Richard Szalay
  • 83,269
  • 19
  • 178
  • 237
  • 5
    For those who are not so familar wit regex (like me), the actual correct code is: \p{Ll} – Run CMD Feb 11 '10 at 15:29
  • To match letters use `\p{L}`. To match letters with diacritics, use `(?>\p{L}\p{M}*)`. To match uppercase letters, use `\p{Lu}`. To match lowercase letters - yes - use `\p{Ll}`. – Wiktor Stribiżew Jan 26 '16 at 10:35
3

What about \p{name} ?

Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.

I don't know enough about unicode, but maybe your characters fit a unicode class?

Ray
  • 45,695
  • 27
  • 126
  • 169
2

See character categories selection with \p and \w unicode semantics.

MarkusQ
  • 21,814
  • 3
  • 56
  • 68
0

This regex allows only valid symbols through:

[a-zA-ZÀ-ÿ ]
John Wakefield
  • 477
  • 1
  • 4
  • 15
0

All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.

The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."

Jason Cohen
  • 81,399
  • 26
  • 107
  • 114
  • I suggested [:alpha:] in an answer I have deleted. I don't know C#, so I am probably wrong, but the regex engines I'm familiar with changes the letters it matches based on locale. – Jon 'links in bio' Ericson Mar 17 '09 at 21:52
  • @Jon: .net does not support [:name:] for named classes, but has alternate syntax for the same purpose. – Richard Mar 18 '09 at 11:27
  • @Jason: You would only need to list if you definition of letter differed from Unicde's, and Character Class Subtraction was insufficuent, e.g. [\p{L}-[\p{IsBasicLatin}]] would match all non-ASCII letters. – Richard Mar 18 '09 at 11:29
0

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

This is not, in general, possible.

After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).

Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.

If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.

Richard
  • 106,783
  • 21
  • 203
  • 265