[A-z0-9]+ regexp matching square brackets

Question

I'm struggling with the following regexp

[A-z0-9]+

If tested against this string:

||a919238[.--a]asd|

it returns a919238[, including the square bracket.. I tried to input my test case on regex101 to understand what's wrong, but the site regex explanation is not helping, probably I'm not able to see my mistake.

Why is the square bracket included in the result?

Look at the [ASCII table](http://www.ascii-code.com/) - which characters are between A and z? — georg, Feb 11 '15 at 08:46
Hmm, I never used A-z before, but I guess that's good because of this! I've always used A-Za-z0-9 to be explicitly clear on the ranges. — Nelson, Feb 11 '15 at 13:16
@Jonny5 yep, i agree. So mine is also also a duplicate of that question. I'm ready to close the question of mine(i refferred) as duplicate but why i reopen this? I didn't say that the question i referred is the canonical question or i'm the first to post answer for this type of question. — Avinash Raj, Feb 12 '15 at 13:28

score 116 · Accepted Answer · answered Feb 11 '15 at 08:46

116

Because

[A-z0-9]+ 
 ↑ ↑

is from A to z, see the ASCII table, ] appears between the two characters:

enter image description here

answered Feb 11 '15 at 08:46

Maroun

94,125
30
188
241

Did you use any too? how did you mark squares? – Grijesh Chauhan Feb 11 '15 at 15:25
1

Huh. The 41/101 - 61/141 symmetry is nice, but this question is a good example of why ASCII having `A-Z`,`a-z` next to each other would be pretty nice. We'd also then have symbols together, so could match e.g. `[@-~]`. Any idea why it was laid out the way it is? – OJFord Feb 11 '15 at 16:04
7

This is the reason why you frequently see `[A-Za-z]` instead of `[A-z]`: to exclude all those extra characters. – Brian J Feb 11 '15 at 16:16
1

@OllieFord IMO it is this way because the earliest computers are scientific equipment and scientists wouldn't be particularly bothered by it. Someone just defined it as such and it stayed this way. – Nelson Feb 11 '15 at 17:45
8

@OllieFord you can switch between upper- and lowercase with the fifth bit alone. – Quentin Feb 11 '15 at 19:21
@Quentin That's due to the 'symmetry' I mentioned, but is that really so useful that that's why? I suppose adding 26 is slightly more expensive, but it's surely not that common? – OJFord Feb 11 '15 at 19:33
@OllieFord that's the only thing I thought about, but I can't find any other argument. I don't know further. – Quentin Feb 11 '15 at 22:33
I think the reason is historical, but it's really interesting. @GrijeshChauhan it's simple Gnome editor :) – Maroun Feb 12 '15 at 11:52
@OllieFord just lack 5 letter in the alphabet or 10 too much and hard to change our alphabet for computer scientist only use :-D – NeronLeVelu Feb 13 '15 at 14:26
@NeronLeVelu I'm not sure what you mean - I'm only suggesting the ordering in ASCII numbering didn't have to split upper and lower cases, not that the alphabet should be changed! – OJFord Feb 13 '15 at 14:29
@OllieFord reason is that there are 26 letter in alphabet and not a corresponding power of 2 -1 like 31 or 15 and they fill the gap with other element – NeronLeVelu Feb 13 '15 at 15:01
@NeronLeVelu but only benefit of that is changing case with single bit as Quentin said, right? – OJFord Feb 13 '15 at 15:04
@OllieFord right, they took the first avalaible slop that can contain the whole alphabet and use the next one for upper case to have a logic (lower/upper) in a non logic (alphabet is conventionnal) group of element in a power of 2 world. – NeronLeVelu Feb 16 '15 at 06:35
@OllieFord The reason of this is because originally only big letters were available at the time, then the special chars `[, \, ], ^, _` were added. And, later, when the `[a-z]` came, they were added after all of that, in order not to break the standards. If the table had to be built nowadays, `[a-z]` would be likely first, then `[A-Z]` etc... – Déjà vu Mar 11 '15 at 01:08
@ring0 Source? [RFC 20](https://tools.ietf.org/html/rfc20) 'ASCII Format for Network Interchange' includes both upper- and lower-case. The 5th-bit-flip seems most convincing, but it would be interesting if anyone has something that explicitly says this (or otherwise). – OJFord Mar 11 '15 at 02:03

score 19 · Answer 2 · answered Feb 11 '15 at 08:48

19

A===>64
z===>122
[===>91

So it is in between the range you have defined.Use [A-Za-z0-9]+

answered Feb 11 '15 at 08:48

vks

67,027
10
91
124

2

Great, I was also looking to remove the square bracket from the result, thanks! – BeNdErR Feb 11 '15 at 08:49
@BeNdErR If you can allow `_` then you can use `\w` .... – Grijesh Chauhan Feb 11 '15 at 15:29
1

@GrijeshChauhan `\w` and `[A-Za-z_]` are not equivalent in any Unicode-aware regex dialect. – Slade Feb 11 '15 at 17:59
@Slade hmm.. you are correct. [a question is also asked](http://stackoverflow.com/a/16621778/1673391) at SO in past – Grijesh Chauhan Feb 11 '15 at 18:39

score 8 · Answer 3 · edited Feb 12 '15 at 05:28

8

You can use /[a-z0-9]+/i (the i makes it case-insensitive), or /[A-Za-z0-9]+/.

edited Feb 12 '15 at 05:28

doppelgreener

4,809
10
46
63

answered Feb 11 '15 at 08:55

Ahosan Karim Asik

3,219
1
18
27

[A-z0-9]+ regexp matching square brackets

3 Answers3

Linked