How is [][] parsed in regex?

Question

Experimenting with simple regexes I found some weird behavior.

Single pair of brackets [] is treated either as an incomplete character class (PCRE and Python) and throws an error, or as an empty character class (JS), which is not an error, but doesn't match anything.

Going forward, JS treats [][] as expected, as two empty classes, but in PCRE and Python innermost brackets ][ are interpreted as literals, even though they are not escaped.

Further experiments showed that three expressions are equivalent in practice:

   [][]
   [\]\[]
   [\[\]]

The second and the third one make sense to me, but why does the first one work? Can someone please explain to me how exactly [][] construction is parsed?

It will differ depending on what language you're using. For Python, the [documentation](https://docs.python.org/2/library/re.html#regular-expression-syntax) says "To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set". — BrenBarn, Feb 26 '16 at 22:30
How it's parsed in which engine? AFAIK the regexes different languages use aren't based off of a real standard, they're mostly ad-hoc derivatives of Perl. If there's no standard, this question is only answerable in the context of a particular language/engine. If you narrow the scope, someone might dive into the language's implementation or spec and find the rules responsible for this behaviour. This question's a little broad as-is. — Jeremy, Feb 26 '16 at 22:30
About pcre you can look at this post: http://stackoverflow.com/questions/17845014/what-does-the-regex-mean/17845034#17845034 — Casimir et Hippolyte, Feb 27 '16 at 00:50

score 3 · Answer 1 · answered Feb 27 '16 at 01:18

Chalk it up to excessive cleverness on the part of the JavaScript designers. They decided [] means nothing (a null construct, no effect on the match), and [^] means not nothing--in other words, anything including newlines. Most other flavors have a singleline/DOTALL mode that allows . to match newlines, but JavaScript doesn't. Instead it offers [^] as a sort of super-dot.

That didn't catch on, which is just as well. As you've observed, it's thoroughly incompatible with other flavors. Everyone else took the attitude that a closing bracket right after an opening bracket should be treated as a literal character. And, since character classes can't be nested (traditionally), the opening bracket never has special meaning inside one. Thus, [][] is simply a compact way to match a square bracket.

Taking it further, if you want to match any character except ], [ or ^, in most flavors you can write it exactly like that: [^][^]. The closing bracket immediately after the negating ^ is treated as a literal, the opening bracket isn't special, and the second ^ is also treated as a literal. But in JavaScript, [^][^] is two separate atoms, each matching any character (including newlines). To get the same meaning as the other flavors, you have to escape the first closing bracket: [^\][^].

The pond gets even muddier when Java jumps in. It introduced a set intersection feature, so you can use, for example, [a-z&&[^aeiou]] to match consonants (the set of characters in the range a to z, intersected with the set of all characters that are not a, e, i, o or u). However, the [ doesn't have to be right after && to have special meaning; [[a-z]&&[^aeiou]] is the same as the previous regex.

That means, in Java you always have to escape an opening bracket with a backslash inside a character class, but you can still escape a closing bracket by placing it first. So the most compact way to match a square bracket in Java is []\[]. I find that confusing and ugly, so I often escape both brackets, at least in Java and JavaScript.

.NET has a similar feature called set subtraction that's much simpler and uses a tighter syntax: [a-z--[aeiou]]. The only place a nested class can appear is after --, and the whole construct must be at the end of the enclosing character class. You can still match a square bracket using [][] in .NET.

How is [][] parsed in regex?

1 Answers1