14

Is the regular expression [a-Z] valid and if yes then is it the same as [a-zA-Z]? Please note that in [a-Z] the a is lowercase and the Z is uppercase.

Edit:

I received some answers specifiying that while [a-Z] is not valid then [A-z] is valid (but won't be the same as [a-zA-Z]) and this is really what I was looking for. Since I wanted to know in general if it's possible to replace [a-zA-Z] with a more compact version.

Thanks for all who contributed to the answer.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Karim
  • 6,113
  • 18
  • 58
  • 83

7 Answers7

35

No, a (97) is higher than Z (90). [a-Z] isn't a valid character class. However [A-z] wouldn't be equivalent either, but for a different reason. It would cover all the letters but would also include the characters between the uppercase and lowercase letters: [\]^_`.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • Yes it is... `[a-Z]` is invalid because `Z` comes before `a` – gnarf Nov 02 '09 at 00:14
  • 3
    I explained why both `[a-Z]` and `[A-z]` are invalid. Don't downvote me for doing extra credit. :-) – John Kugelman Nov 02 '09 at 00:19
  • 1
    I am unsure whether regexes are only specified for ASCII. Couldn't this also be dependent on the encoding and collation? – Svante Nov 02 '09 at 07:15
  • [a-Z] is invalid in the C locale, yes. In that locale, the numeric value of the encoded character is the order. But that does not apply to many other locales (for example en_US.utf8). In that locale, [a-Z] represents an existing collation order and therefore is valid. Furthermore, it represents all the upper and lower letters in the ASCII range. –  May 16 '18 at 18:08
  • Easily the best answer here with reference to the op's question. Perhaps it is also worth adding in suggestions about making the regular expression case insensitive to aid readability for anything more complex than these simple examples, if possible in the library/language variant in use (e.g. `/[a-z]/i` or `(?i)[a-z]`) – David Long Jul 12 '18 at 13:55
4

I'm not sure about other languages' implementations, but in PHP you can do

"/[a-z]/i"

and it will case insensitive. There is probably something similar for other languages.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
helloandre
  • 10,541
  • 8
  • 47
  • 64
  • Most of PHP's features come from Perl, including this one. (PHP used to be written in Perl. Actually one of the P's used to stand for Perl) – Brad Gilbert Nov 02 '09 at 01:33
3

You don't specify what language, but in general [a-Z] won't be a valid range, as in ASCII the lower-case alpha characters come after the upper-case ones. [A-z] might be a valid range (indicating all upper- and lower-cased alphas as well as the punctuation that appears between Z and a), but it might not be, depending on your particular implementation. The i flag can be added to the regex to make it case-insensitive; check your particular implementation for instructions on how to specify that flag.

Ether
  • 53,118
  • 13
  • 86
  • 159
2

I've just fallen over this in a script (not my own).

It seems that grep, awk, sed accept [a-Z] based on your locale (i.e. LANG or LC_CTYPE environment variable). In POSIX, [a-Z] isn't allowed by these tools, but in some other locales (e.g. en_gb.utf8) it works, and is the same as [a-zA-Z].

Yes, I've checked, it doesn't match any of _^[]`.

Given that this has taken quite some time to debug, I strongly discourage anyone from ever using [a-Z] in a regex.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
2

You could always try it:

 print "ok" if "monkey" =~ /[a-Z]/;

Perl says

Invalid [] range "a-Z" in regex; marked by <-- HERE in m/[a-Z <-- HERE ]/ at a-z.pl line 4.
Jeff Atwood
  • 63,320
  • 48
  • 150
  • 153
  • 2
    Exactly what I said. My favorite saying is "try it 'n c" because if you happen to be developing in C at the time it has two meanings. – Robert Massaioli Nov 02 '09 at 00:07
  • 3
    I don't like "try it and see" because if he had tried `[A-z]` there'd be no error message but it wouldn't work right either. – John Kugelman Nov 02 '09 at 00:09
  • This is because in ASCII, uppercase comes first. So, [A-z] is valid, but [a-Z] is not. – jheddings Nov 02 '09 at 00:09
  • But he's not asking that question. The question is very clear. Why are you deliberately misinterpreting it? –  Nov 02 '09 at 00:18
2

If it's valid, it won't do what you expect.

The character code of Z is lower than the character code of a, so if the codes are swapped to mean the range [Z-a], it will be the same as [Z\[\\\]^_`a], i.e. it will include the characters Z and a, and the characters between.

If you use [A-z] to get all upper and lower case characters, that is still not the same as [A-Za-z], it's the same as [A-Z\[\\\]^_`a-z].

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
1

No, it's not valid, probably because the ASCII values are not consecutive from z to A.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ennuikiller
  • 46,381
  • 14
  • 112
  • 137