1

After toddling in regex101 for a few minutes, I realized that ] does not need to be escaped, if it immediately follws [.

In regex101, the pattern []-a-z] is described as

/[]-a-z]/ []-a-z] match a single character present in the list below ]-a a single character in the range between ] and a (case sensitive) -z a single character in the list -z literally (case sensitive)

But I always thought, if - has to be matched literally without being escaped, it should either go at the beginning, or at end.

Then why is my pattern not recognized as an error? Why does -z matches a single character in the list -z literally ?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
MAKZ
  • 165
  • 14

3 Answers3

3

Let's break it down:

[]-a-z]
 ^^ ^
 || +---- 3
 |+------ 2
 +------- 1

1 is a literal ] since it appears at the start of the pattern, and [] is an invalid character class in PCRE.

The 2 hyphen is therefore the second character in the class, and introduces a range, between ] and a.

The next hyphen, 3, is treated literally, because the previous token, a is the end of the previous range. Another range cannot be introduced at this point. In PCRE, a - is treated literally if it's in a place where a range cannot be introduced or if it's escaped. We usually place literal hyphens at the start or the end of the range to make it obvious, but this is not required.

Then, z is a simple literal.

PCRE follows the Perl syntax. This is documented like so:

About ]:

A ] is normally either the end of a POSIX character class (see POSIX Character Classes below), or it signals the end of the bracketed character class. If you want to include a ] in the set of characters, you must generally escape it.
However, if the ] is the first (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) and is considered part of the set of characters that can be matched without escaping.

About hyphens:

If a hyphen in a character class cannot syntactically be part of a range, for instance because it is the first or the last character of the character class, or if it immediately follows a range, the hyphen isn't special, and so is considered a character to be matched literally. If you want a hyphen in your set of characters to be matched and its position in the class is such that it could be considered part of a range, you must escape that hyphen with a backslash.

Note that this refers to Perl syntax. Other flavors may have different behavior. For instance, [] is a valid (empty) character class in JavaScript that cannot match anything.

The catch is that, depending on the options, PCRE could also interpret this in the JS way (there's a couple of JS compatibility flags). From the PCRE2 docs:

An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special by default. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. This means that, by default, an empty class cannot be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at the start does end the (empty) class.

The documented PCRE behavior about the hyphen is, unsurprisingly, matching the Perl behavior:

The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class, or immediately after a range. For example, [b-d-z] matches letters in the range b to d, a hyphen character, or z.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • It helped, thanks. One more thing, Is this behaviour documented? I mean to ask will this pattern always be fail-proof ? – MAKZ Apr 05 '15 at 15:19
  • Note: in POSIX BRE, this is an invalid Bracket expression. Hyphens need to stand first or last if they are not part of a range, and in combination with `]` it has to go last. See [Posix Regular Expressions: 9.3.5 point 7](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04) – kvantour Oct 20 '21 at 09:09
2

The regex does not fail because the - means a range here, from ] to a. ] does not have to be escaped at the starting position inside the character class, thus it is considered as a literal here. The character class is valid as ] has a 93 ASCII code, and a has a 97 code in the ASCII table.

EDIT:

There is one thing that is universal about regexes: they are analyzed from left to right. Thus, the range is formed using the first characters around the first hyphen. The 2nd hyphen goes right after the range end character, and it cannot be used as a starting range character as it is "occupied". Thus, the regex engine can't but parse the 2nd hyphen as a literal

See PCRE Reference:

The minus (hyphen) character can be used to specify a range of charac- ters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class, or immediately after a range. For example, [b-d-z] matches letters in the range b to d, a hyphen charac- ter, or z.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • recheck the question please. – MAKZ Apr 05 '15 at 15:07
  • 1
    I understand that `]-a` is a valid range. But I have doubts regarding `-z` not being treated as an error. – MAKZ Apr 05 '15 at 15:09
  • @MAKZ: There is one thing that is universal about regexes: they are analyzed from left to right. Thus, the range is formed using the first characters around the first hyphen. The 2nd hyphen goes right after the range end character, and it cannot be used as a starting range character as it is "occupied". Thus, the regex engine can't but parse the 2nd hyphen as a literal. – Wiktor Stribiżew Apr 05 '15 at 15:13
  • @MAKZ: Please see updated reference from PCRE documentation that I added to the answer. – Wiktor Stribiżew Apr 05 '15 at 15:23
  • Nice. I randomly selected an answer. But you earned an upvote. – MAKZ Apr 05 '15 at 15:57
2

Regex Info:

Hyphens at other positions in character classes where they can't form a range may be interpreted as literals or as errors. Regex flavors are quite inconsistent about this.

So, here - can't form a range as the previous token is a range as opposed to a character and hence it's interpreted as a literal -

Amit Joki
  • 58,320
  • 7
  • 77
  • 95
  • I am talking about the second hyphen – MAKZ Apr 05 '15 at 15:07
  • I understand that ]-a is a valid range. But I have doubts regarding -z not being treated as an error. – MAKZ Apr 05 '15 at 15:09
  • your updated answer is logically correct, of course. But isn't it necessary to either escape the hyphen or send it to either end, according to rule ? – MAKZ Apr 05 '15 at 15:12
  • @MAKZ there's no rule kind of thing. If it was in the middle it will form a range. And for a range to form, it needs to be between a single character to a single character. It works fine. – Amit Joki Apr 05 '15 at 15:14