8

I am trying to match a string to see if it only consists out of letters. All kinds of letters should be allowed. So the typical a-zA-Z, but also áàéèó... etc.

I tried to match it with the following regex: ([\S])*

But this also allows characters like \/<>*()... etc. Those are obviously characters that don't belong in a name. How does the regex looks like when i only want to allow letters and 'special' letters?

Vivendi
  • 20,047
  • 25
  • 121
  • 196
  • possible duplicate of [Regex white list for input validation - accent insensitive](http://stackoverflow.com/questions/5665570/regex-white-list-for-input-validation-accent-insensitive) – CodeCaster Feb 28 '13 at 09:17
  • Because if it works in C# _and_ Javascript, it doesn't work for C# only? But OK, then this one: [Regex accent insensitive?](http://stackoverflow.com/questions/6664582/regex-accent-insensitive), which also says "Use \w+". – CodeCaster Feb 28 '13 at 09:22
  • `All kinds of letters should be allowed`: Does this mean you also want Chinese, Korean, Thai, etc. characters to be allowed? – nhahtdh Feb 28 '13 at 09:26
  • 1
    CodeCaster, `\w` is horrible for almost all real-world uses. It allows letters as well as digits and the underscore, in many regex engines it's not Unicode-enabled and really matches only ASCII. It was meant as a crude shortcut for matching identifiers in common programming languages three decades ago (guessed), it's a poor and nigh-useless choice for processing actual text. And, being based on `\w`, `\b` falls in the same category of almost useless. – Joey Feb 28 '13 at 09:32
  • CodeCaster, I take that earlier comment back. They actually need a regex that works in both C# and JavaScript, but it wasn't apparent from the question (or they didn't even know at the time). – Joey Feb 28 '13 at 09:51

2 Answers2

7

For a non-REGEX solution you can use char.IsLetter

Char.IsLetter Method

Indicates whether the specified Unicode character is categorized as an alphabetic letter.

string str = "Abcáàéèó";
bool result = str.All(char.IsLetter);

This would give false result for digits and \/<>*() etc.

Habib
  • 219,104
  • 29
  • 407
  • 436
  • Since they're trying to validate stuff using a facility that allows for regex validation I guess a non-regex solution won't really work. I still gave you +1 earlier due to the elegance, albeit it still would fail for combining characters (as did my initial solution). – Joey Feb 28 '13 at 09:40
6

You can use the character class that says exactly that:

\p{L}

So the regex

^\p{L}+$

will match if the string consists only of letters. If you expect combining characters, then

^(\p{L}\p{M}*)+$

works.

Quick PowerShell test:

PS> 'foo','bär','a.b','&^#&%','123','кошка' -match '^\p{L}+$'
foo
bär
кошка
Joey
  • 344,408
  • 85
  • 689
  • 683
  • 1
    Note that this allow letter in any language (Chinese, Korean, etc.), not just Latin-based scripts. – nhahtdh Feb 28 '13 at 09:20
  • 1
    +1 As additional information the [regular-expressions.info page about Unicode Character Properties](http://www.regular-expressions.info/unicode.html#prop) – stema Feb 28 '13 at 09:21
  • 1
    nhahtdh: Well, yes, that's what I understand when they say »All kinds of letters should be allowed«. – Joey Feb 28 '13 at 09:22
  • 2
    How does this deal with surrogates? I.e. does something Like U+0065 U+0301 (= “e” + “COMBINING ACUTE ACCENT” = é) match? (It works in OS X’ `grep`, I’m asking specifically for .NET here.) – Konrad Rudolph Feb 28 '13 at 09:24
  • 2
    Konrad, those are not surrogates; they're combining characters. But it fails on those; I'll fix it. – Joey Feb 28 '13 at 09:30
  • I'm actually trying to put this regex in a `RegularExpressionValidator` (ASP.NET control). But it is not validating for me. Not even with normal letters: `ValidationExpression="^\p{L}+$"`. I also tried `^\w+$`, which was suggested by others. That does allow normal characters, but fails when i enter something like `é`. Any idea what the problem could be? – Vivendi Feb 28 '13 at 09:34
  • @Joey Damn, I always mix up surrogates and code units. What I meant was characters consisting of more than une UTF-16 code unit (and before someone objects, there seem to be different usages of the term “code unit” – the example above should make it clear what’s meant). – Konrad Rudolph Feb 28 '13 at 09:35
  • Vivendi: »The regular expression validation syntax is slightly different on the client than on the server. On the client, JScript regular expression syntax is used. On the server, Regex syntax is used. Since JScript regular expression syntax is a subset of Regex syntax, it is recommended that you use JScript regular expression syntax in order to yield the same results on both the client and the server.« – this regex was specifically for .NET and other useful regex engines. You see it fail because the client-side validation uses JavaScript which has vastly inferior regexes. – Joey Feb 28 '13 at 09:36
  • So I guess the possible duplicate suggested by CodeCaster might in fact be for you. Or you'll disable client-side validation (because it either becomes a very messy regex or will yield quite many false positives). – Joey Feb 28 '13 at 09:37
  • Ah sorry, i didn't know it was different. I'll try to lookup how to do this in JavaScript then. Thanks. – Vivendi Feb 28 '13 at 09:38
  • Konrad, in that case you are talking about *graphemes* that consist of more than one code *point* (which, for the BMP is identical with more than one code *unit*. My fix above is for your example, but I think it will still fail for actual surrogates, i.e. characters outside the BMP which are represented in UTF-16 with two code *units* (but are still one code *point*). The terminology is actually fairly clear here and not that hard to confuse (then again, I've been lurking on the Unicode mailing list for a few years by now) ;-) – Joey Feb 28 '13 at 09:39
  • Hmm, so how does this comply to the Unicode definition for regular expressions then? Because, like I said, `grep [[:alpha:]]` actually matches U+0065 U+0301 on OS X – is that allowed? Is it required? (And I just realised how confused my talk about code units was.) – Konrad Rudolph Feb 28 '13 at 09:53
  • I haven't read the TR on that, I have to admit. The POSIX character classes work differently from those that match Unicode character properties. It could also be that your `grep` converts to a normalisation form and accounts for that. Generally I'd say `\p{L}` must match any code point that is defined as a Letter; after all, that's what it says on the tin. For matching characters followed by combining marks there is `\X`, but that's not widely supported. And I doubt that `\p{L}` is required to act like `\X`. – Joey Feb 28 '13 at 10:12