53

I'm struggling with the following regexp

[A-z0-9]+

If tested against this string:

||a919238[.--a]asd|

it returns a919238[, including the square bracket.. I tried to input my test case on regex101 to understand what's wrong, but the site regex explanation is not helping, probably I'm not able to see my mistake.

Why is the square bracket included in the result?

BeNdErR
  • 17,471
  • 21
  • 72
  • 103
  • 29
    Look at the [ASCII table](http://www.ascii-code.com/) - which characters are between A and z? – georg Feb 11 '15 at 08:46
  • 1
    @georg square bracket! thanks – BeNdErR Feb 11 '15 at 08:49
  • 8
    Hmm, I never used A-z before, but I guess that's good because of this! I've always used A-Za-z0-9 to be explicitly clear on the ranges. – Nelson Feb 11 '15 at 13:16
  • 4
    Better than `[a-Z]` :-) – Bergi Feb 11 '15 at 19:42
  • @Jonny5 yep, i agree. So mine is also also a duplicate of that question. I'm ready to close the question of mine(i refferred) as duplicate but why i reopen this? I didn't say that the question i referred is the canonical question or i'm the first to post answer for this type of question. – Avinash Raj Feb 12 '15 at 13:28

3 Answers3

116

Because

[A-z0-9]+ 
 ↑ ↑ 

is from A to z, see the ASCII table, ] appears between the two characters:

enter image description here

Maroun
  • 94,125
  • 30
  • 188
  • 241
  • Did you use any too? how did you mark squares? – Grijesh Chauhan Feb 11 '15 at 15:25
  • 1
    Huh. The 41/101 - 61/141 symmetry is nice, but this question is a good example of why ASCII having `A-Z`,`a-z` next to each other would be pretty nice. We'd also then have symbols together, so could match e.g. `[@-~]`. Any idea why it was laid out the way it is? – OJFord Feb 11 '15 at 16:04
  • 7
    This is the reason why you frequently see `[A-Za-z]` instead of `[A-z]`: to exclude all those extra characters. – Brian J Feb 11 '15 at 16:16
  • 1
    @OllieFord IMO it is this way because the earliest computers are scientific equipment and scientists wouldn't be particularly bothered by it. Someone just defined it as such and it stayed this way. – Nelson Feb 11 '15 at 17:45
  • 8
    @OllieFord you can switch between upper- and lowercase with the fifth bit alone. – Quentin Feb 11 '15 at 19:21
  • @Quentin That's due to the 'symmetry' I mentioned, but is that really so useful that that's why? I suppose adding 26 is slightly more expensive, but it's surely not that common? – OJFord Feb 11 '15 at 19:33
  • @OllieFord that's the only thing I thought about, but I can't find any other argument. I don't know further. – Quentin Feb 11 '15 at 22:33
  • I think the reason is historical, but it's really interesting. @GrijeshChauhan it's simple Gnome editor :) – Maroun Feb 12 '15 at 11:52
  • @OllieFord just lack 5 letter in the alphabet or 10 too much and hard to change our alphabet for computer scientist only use :-D – NeronLeVelu Feb 13 '15 at 14:26
  • @NeronLeVelu I'm not sure what you mean - I'm only suggesting the ordering in ASCII numbering didn't have to split upper and lower cases, not that the alphabet should be changed! – OJFord Feb 13 '15 at 14:29
  • @OllieFord reason is that there are 26 letter in alphabet and not a corresponding power of 2 -1 like 31 or 15 and they fill the gap with other element – NeronLeVelu Feb 13 '15 at 15:01
  • @NeronLeVelu but only benefit of that is changing case with single bit as Quentin said, right? – OJFord Feb 13 '15 at 15:04
  • @OllieFord right, they took the first avalaible slop that can contain the whole alphabet and use the next one for upper case to have a logic (lower/upper) in a non logic (alphabet is conventionnal) group of element in a power of 2 world. – NeronLeVelu Feb 16 '15 at 06:35
  • @OllieFord The reason of this is because originally only big letters were available at the time, then the special chars `[, \, ], ^, _` were added. And, later, when the `[a-z]` came, they were added after all of that, in order not to break the standards. If the table had to be built nowadays, `[a-z]` would be likely first, then `[A-Z]` etc... – Déjà vu Mar 11 '15 at 01:08
  • @ring0 Source? [RFC 20](https://tools.ietf.org/html/rfc20) 'ASCII Format for Network Interchange' includes both upper- and lower-case. The 5th-bit-flip seems most convincing, but it would be interesting if anyone has something that explicitly says this (or otherwise). – OJFord Mar 11 '15 at 02:03
19
A===>64
z===>122
[===>91

So it is in between the range you have defined.Use [A-Za-z0-9]+

vks
  • 67,027
  • 10
  • 91
  • 124
8

You can use /[a-z0-9]+/i (the i makes it case-insensitive), or /[A-Za-z0-9]+/.

doppelgreener
  • 4,809
  • 10
  • 46
  • 63
Ahosan Karim Asik
  • 3,219
  • 1
  • 18
  • 27