Let's break it down:
[]-a-z]
^^ ^
|| +---- 3
|+------ 2
+------- 1
1
is a literal ]
since it appears at the start of the pattern, and []
is an invalid character class in PCRE.
The 2
hyphen is therefore the second character in the class, and introduces a range, between ]
and a
.
The next hyphen, 3
, is treated literally, because the previous token, a
is the end of the previous range. Another range cannot be introduced at this point. In PCRE, a -
is treated literally if it's in a place where a range cannot be introduced or if it's escaped. We usually place literal hyphens at the start or the end of the range to make it obvious, but this is not required.
Then, z
is a simple literal.
PCRE follows the Perl syntax. This is documented like so:
About ]
:
A ]
is normally either the end of a POSIX character class (see POSIX Character Classes below), or it signals the end of the bracketed character class. If you want to include a ]
in the set of characters, you must generally escape it.
However, if the ]
is the first (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) and is considered part of the set of characters that can be matched without escaping.
About hyphens:
If a hyphen in a character class cannot syntactically be part of a range, for instance because it is the first or the last character of the character class, or if it immediately follows a range, the hyphen isn't special, and so is considered a character to be matched literally. If you want a hyphen in your set of characters to be matched and its position in the class is such that it could be considered part of a range, you must escape that hyphen with a backslash.
Note that this refers to Perl syntax. Other flavors may have different behavior. For instance, []
is a valid (empty) character class in JavaScript that cannot match anything.
The catch is that, depending on the options, PCRE could also interpret this in the JS way (there's a couple of JS compatibility flags). From the PCRE2 docs:
An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special by default. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. This means that, by default, an empty class cannot be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS
option is set, a closing square bracket at the start does end the (empty) class.
The documented PCRE behavior about the hyphen is, unsurprisingly, matching the Perl behavior:
The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m]
matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class, or immediately after a range. For example, [b-d-z]
matches letters in the range b
to d
, a hyphen character, or z
.