85

I want to know the regexp for the following case:

The string should contain only alphabetic letters. It must start with a capital letter followed by small letter. Then it can be small letters or capital letters.

^[A-Z][a-z][A-Za-z]*$

But the string must also not contain any consecutive capital letters. How do I add that logic to the regexp?

That is, HttpHandler is correct, but HTTPHandler is wrong.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
kiki
  • 13,627
  • 17
  • 49
  • 62

5 Answers5

186

Whenever one writes [A-Z] or [a-z], one explicitly commits to processing nothing but 7-bit ASCII data from the 1960s. If that’s really ok, then fine. But if it’s not ok, then Unicode character properties exist to help you with handling modern character data.

There are three cases in Unicode, not two. Furthermore, you also have noncased letters. Letters in general are specified by the \pL property, and each of these also belongs to exactly one of five subcategories:

  1. uppercase letters, specified with \p{Lu}; eg: AÇDZÞΣSSὩΙST
  2. titlecase letters, specified with \p{Lt}; eg: LjDzSsᾨSt (actually Ss and St are an upper- and then a lowercase letter, but they are what you get if you ask for the titlecase of ß and ſt, respectively)
  3. lowercase letters, specified with \p{Ll}; eg: aαçdzςσþßᾡſt
  4. modifier letters, specified with \p{Lm}; eg: ʰʲᴴᴭʺˈˠᵠꜞ
  5. other letters, specified with \p{Lo}; eg: ƻאᎯᚦ京

You can take the complement of any of these, but do be careful, because something like \P{Lu} does not mean a letter that isn’t uppercase! It means any character that isn’t an uppercase letter.

For a letter that’s either of uppercase or titlecase, use [\p{Lu}\p{Lt}]. So you could use for your pattern:

 ^([\p{Lu}\p{Lt}]\p{Ll}+)+$

If you don’t mean to limit the letters following the first to the “casing” letters alone, then you might prefer:

 ^([\p{Lu}\p{Lt}][\p{Ll}\p{Lm}\p{Lo}]+)+$

If you’re trying to match so-called “CamelCase” identifiers, then the actual rules depend on the programming language, but usually include the underscore character and the decimal numbers (\p{Nd}), and may also include a literal dollar sign and other language-dependent characters. If so, you may wish to add some of these to one or the other of the two character classes provided above.

For example, you may wish to add underscore to both but digits only to the second, leaving you with:

 ^([_\p{Lu}\p{Lt}][_\p{Nd}\p{Ll}\p{Lm}\p{Lo}]+)+$

If, though, you are dealing with certain “words” from various RFCs and ISO standards, these are often specified as containing ASCII only. If so, you can get by with the literal [A-Z] idea. It’s just not kind to impose that restriction if it doesn’t actually exist.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Totally agreed concerning the restriction. Here's is a bit more info about this : [Regular-expressions.info](http://www.regular-expressions.info/unicode.html). – Daneo Dec 26 '12 at 10:29
  • 3
    In case you want to use `re` in python, you have to know that it doesn't support Unicode character property. http://pypi.python.org/pypi/regex does. – noisy Jun 01 '13 at 13:06
  • 6
    Hold on a second, there are people that **don't** use perl for regexen? – hd1 Jul 29 '13 at 18:07
  • your Ὡ isn't recognized by perl as Capital letter, use this omega -- Ω :-) :-P – Bogdan Mart Jun 26 '19 at 19:02
  • @BogdanMart Thanks, apparently I had a copy-paste error. U+1FA8 ‭ ᾨ `GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI` is a titlecase codepoint, so I've put that there. Ὡ and Ω are both uppercase, not titlecase. – tchrist Jun 26 '19 at 19:30
  • I haven't been able to figure out how to get your examples to work. https://regex101.com/r/yYExY4/1 Do these work in any of the languages in regex101? What am I misunderstanding? Thanks. – Ryan Sep 13 '19 at 21:54
  • @Ryan your example works perfectly well, you just have a trailing whitespace in the line that should match. And since $ matches only the end of the line, trailing whitespaces in the line will not be accepted (which is correct, since an identifier containing a space is not valid) – Falco Sep 14 '20 at 08:52
  • @Falco Thanks. Multiple lines in my link above should have matched though (at least for my goals). This version 5 of that fiddle works better for me: https://regex101.com/r/yYExY4/5 – Ryan Sep 14 '20 at 12:49
  • 1
    @Ryan it really depends on your requirements - your Regex will only allow "Uppercase A-Z and Digits 0-9" and the first character cannot be a digit. - But if your requirements for example are "Any Uppercase Letter (any Language) and any Digit (any Language not only 0-9) - then you could e.g. use the following: `[\p{Lu}\p{Lt}][\p{Lu}\p{Nd}]+` – Falco Sep 14 '20 at 13:15
  • @Falco Oh ok, great! Thank you! I updated https://regex101.com/r/yYExY4/6 – Ryan Sep 14 '20 at 13:17
  • Make sure to add the `u` flag to your RegExp in JavaScript to enable support for matching with `\p{...}` – derpedy-doo Mar 06 '22 at 13:40
51

Take a look at tchrist's answer, especially if you develop for the web or something more "international".

Oren Trutner's answer isn't quite right (see sample input of "RightHerE" which must be matched, but isn't).

Here is the correct solution:

(?!^.*[A-Z]{2,}.*$)^[A-Za-z]*$

Explained:

(?!^.*[A-Z]{2,}.*$)  // don't match the whole expression if there are two or more consecutive uppercase letters
^[A-Za-z]*$          // match uppercase and lowercase letters

/edit

The key for the solution is a negative lookahead. See: Lookahead and Lookbehind Zero-Length Assertions

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Stephan Schinkel
  • 5,270
  • 1
  • 25
  • 42
  • wat does ? ! . etc stand for? – kiki Nov 02 '10 at 10:11
  • 2
    it's a negative lookahead - see my posted link for an in depth explanation. basically it says that if the regex in between the negative lookahead is matched the whole expression is not matched. so you can for example say: ^[0-9]$ (match one number from 0 to 9. and you can say (?!^3$)^[0-9]$ (match one number from 0 to 9 except 3). – Stephan Schinkel Nov 02 '10 at 14:05
11
^([A-Z][a-z]+)+$

This looks for sequences of an uppercase letter followed by one or more lowercase letters. Consecutive uppercase letters will not match, as only one is allowed at a time, and it must be followed by a lowercase one.

Oren Trutner
  • 23,752
  • 8
  • 54
  • 55
  • Pls excuse my ignorance. See, this is my regexp as of now: (^[A-Z][a-z][A-Za-z]*$)|(^I[A-Z][a-z][A-Za-z]*$). Into this, I have to add the logic to check capital letters do not com together in the [A-Za-z] portion. What would you suggest? And what does + mean exactly? – kiki Oct 29 '10 at 10:52
  • This won't match the valid TestX since you will not match the final uppercase letter – Falco Jan 20 '15 at 15:25
8

Aside from tchrist's excellent post concerning Unicode, I think you don't need the complex solution with a negative lookahead... Your definition requires an uppercase-letter followed by at least one group of (a lowercase letter optionally followed by an uppercase-letter):

^
[A-Z]    // Start with an uppercase Letter
(        // A Group of:
  [a-z]  // mandatory lowercase letter
  [A-Z]? // an optional Uppercase Letter at the end
         // or in between lowercase letters
)+       // This group at least one time
$

It is just a bit more compact and easier to read, I think...

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Falco
  • 3,287
  • 23
  • 26
-20

If you want to get all employee names in MySQL which have at least one uppercase letter then apply this query:

SELECT * FROM registration WHERE `name` REGEXP BINARY '[A-Z]';
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Gaurav Kumar
  • 175
  • 2
  • 4