5

I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters:

/(?=.*[a-z])/
Amen
  • 713
  • 4
  • 15
  • 28
  • 2
    Do you realize that this regex is equivalent to `/[a-z]/` and matches every string that contains at least one lower-case letter? –  Feb 02 '11 at 15:05
  • Also, what is the use of the parentheses if you’re going to discard (`?=`) the capture anyway? – Martijn Feb 02 '11 at 15:08
  • @Martjin: Parens are required for lookahead (and many other things) and don't actually group. –  Feb 02 '11 at 15:12
  • 1
    @delnan: `[a-z]` fails to match 1723 lowercase letters, and fails to match 1882 lowercase code points overall. Basically, very very very nearly every time you see somebody writing a-z, they’ve screwed up. – tchrist Feb 02 '11 at 18:23
  • @tchrist: Yes, most regexes written are unicode-ignorant. (But if you're just parsing some ASCII-only stuff, sticking to ASCII is valid!) –  Feb 02 '11 at 18:27
  • @delnan: It is an exceedingly poor habit to be stuck in a 1960s data-processing mode. Unicode is backwards compatible with ASCII, but ASCII is not forwards compatible with Unicode. Fifty-year-old ASCII is more than twenty years out of date. The overwhelming majority of the web is Unicode these days you know. Develop good habits **now** so you don’t dinosaur yourself or your code. – tchrist Feb 02 '11 at 18:30
  • 2
    @tchrist: Not exactly sure what you're trying to tell me. Yes, about every software that will ever be outside the U.S. should be aware of unicode and propably use it. (And my code is, as far as I can tell.) However, if the spec says "ascii lowercase letters", that's what the regex has to match - so no, `[a-z]` doesn't have to be "you fail i18n forever". –  Feb 02 '11 at 18:34
  • @delnan: It is a misunderstand to think that one can write even American English properly using ASCII alone; one cannot. Sure, if I’m conforming to an RFC I will specify the precise code points which that RFC demands. Otherwise I do not, because I understand that `[a-z]` is simply wrong. – tchrist Feb 02 '11 at 18:37
  • @tchrist: Can you elaborate on that point about not being able to write American English properly? (And while I'm at it, can we agree that you can write American English words, i.e., if they're clearly not imported from other languages? Because that's the point I'm specifically trying to make here, and I feel the OP also similarly doesn't care about that subtlety) – Platinum Azure Feb 03 '11 at 17:24
  • 3
    @Platinum The only English words one wouldn’t really consider “from another language” would (debatably) be those from Old English. Regular words like *naïve* & *façade* (quite unlike *waive* and *arcade*) and names like *Zoë* & *Renée* are all pretty firmly established, with others like *El Niño* & *Cañon City* probably here to stay. *Ægypt, archæology, œnology, rôle, pæan, learnèd, coördinate, résumé, reëlect, première,* and *décor* were all once standard — **and sometimes still are.** Then consider punctuation: *O’Reilly, Hawai‘i, “Iowa–Wisconsin”, un‐American*. **See the big problem now?** – tchrist Feb 03 '11 at 18:27

4 Answers4

22

You Can’t Get There from Here

I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters: /(?=.*[a-z])/

Unfortunately, it is utterly impossible to do this correctly using Javascript! Read this flavor comparison’s ECMA column for all of what Javascript cannot do.

Theory vs Practice

The proper pattern for lowercase is the standard Unicode derived binary property \p{Lowercase}, and the proper pattern for uppercase is similarly \p{Uppercase}. These are normative properties that sometimes include non-letters in them under certain exotic circumstances.

Using just General Category properties, you can have \p{Ll} for Lowercase_Letter, \p{Lu} for Uppercase_Letter, and \p{Lt} for titlecase letter. Remember they are three cases in Unicode, not two). There is a standard alias \p{LC} which means [\p{Lu}\p{Lt}\p{Ll}].

If you want a letter than is not a lowercase letter, you could use (?=\P{Ll})\pL. Written in longhand that’s (?=\P{Lowercase_Letter})\p{Letter}. Again, these mix some of the Other_Lowercase code points that \p{Lowercase} recognizes. I must again stress that the Lowercase property is a superset of the Lowercase_Letter property.

Remember the previous paragraph, swapping in upper everywhere I have written lower, and you get the same thing for the capitals.

Possible Platforms

Because access to these essential properties is the minimal level of critical functionality necessary for Unicode regular expressions, some versions of Javascript implement them in just the way I have written them above. However, the standard for Javascript still does not require them, so you cannot in general count on them. This means that it is impossible to this correctly under all implementations of Javascript.

Languages in which it is possible to do what you want done minimally include:

  • C♯ and Java (both only General Categories)
  • Ruby if and only if v1.9 or better (only binary properties, including General Categories)
  • PHP and PCRE (only General Category and Script properties plus a couple extras)
  • ICU’s C++ library and Perl, which both support all Unicode properties

Of those listed bove, only the last line’s — ICU and Perl — strictly and completely meet all Level 1 compliance requirements (plus some Levels 2 and 3) for the proper handling of Unicode in regexes. However, all of those I’ve listed in the previous paragraph’s bullets can easily handle most, and quite probably all, of what you need.

Javascript is not amongst those, however. Your version might, though, if you are very lucky and never have to run on a standard-only Javascript platform.

Summary

So very sadly, you cannot really use Javascript regexes for Unicode work unless you have a non-standard extension. Some people do, but most do not. If you do not, you may have to use a different platform until the relevant ECMA standard catches up with the 21st century (Unicode 3.1 came out a decade ago!!).

If anyone knows of a Javascript library that implements the Level 1 requirements of UTS#18 on Unicode Regular Expressions including both RL1.2 “Properties” and RL1.2a “Annex C: Compatibility Properties”, please chime in.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 1
    Should have known you'd be writing the treatise. :-) My only reply is that there's no indication in the question that the OP cares about Unicode or i18n. :-) – Platinum Azure Feb 02 '11 at 20:47
  • 1
    i18n is irrelevant. Assume all 7-but hardcodings wrong by default unless proven otherwise. Welcome to last 20 years! – tchrist Feb 02 '11 at 21:52
  • 1
    Great answer :) Just came across this while researching the topic. This should really be the accepted answer. – Niklas B. Apr 21 '12 at 15:50
  • What would you say about [XRegExp](http://xregexp.com/plugins/#unicode) lib? At least it can tell between lowercase an uppercase letters. – Antony Hatchkins Jun 21 '13 at 17:36
16

Not sure if you mean mixed-case, or strictly lowercase plus strictly uppercase.

Here's the mixed-case version:

/^[a-zA-Z]+$/

And the strictly one-or-the-other version:

/^([a-z]+|[A-Z]+)$/
Platinum Azure
  • 45,269
  • 12
  • 110
  • 134
8

Try /(?=.*[a-z])/i

Note the i at the end, this makes the expression case insensitive.

Leigh
  • 12,859
  • 3
  • 39
  • 60
2

Or add an uppercase range to your regex:

/(?=.*[a-zA-Z])/
karim79
  • 339,989
  • 67
  • 413
  • 406
  • 1
    That is not the way one matches uppercase letters. Or lowercase ones, for that matter. It’s a horribly 1960s approach. It’s at least 20 years out of date, and has no place in modern text processing. – tchrist Feb 02 '11 at 18:26
  • 2
    @tchrist - nice, 1960s, 20 years, great. Explanation please? Kindly help me improve my regexes. – karim79 Feb 03 '11 at 00:11
  • 7-bit ASCII appeared in the 60s. Unicode is now 20 years old, and the ISO 8859 codes are older still. ASCII is too old-school to serve the world today. The web is *not* ASCII! – tchrist Feb 03 '11 at 00:20
  • 2
    I don't know about the OP's use case but a lot of my JavaScript runs on Wiktionary data and in that world we have upper and lower case in the Armenian, Cyrillic, and Greek alphabets, as well as quite a lot more than 26 Latin letters too. – hippietrail Nov 23 '12 at 12:56