This is common error among regex beginners, and it's a serious one. Square brackets are used to create character classes, while parentheses create groups. Not only do these constructs serve different purposes, there is no overlap in their functions. In particular, square brackets are not used for grouping. Here are some examples to illustrate:
(abc)
matches the sequence "abc"
[abc]
matches one of the characters 'a'
, 'b'
, or 'c'
.
(abc)+
matches abc
one or more times ("abc"
, "abcabc"
, etc.)
[abc]+
matches one or more characters from the set {'a', 'b', 'c'}
("a"
, "cc"
, "baccbcaab"
, etc.)
(x+)
matches at least one 'x'
("x"
, "xx"
, "xxxxxxxx"
, etc.)
[x+]
matches 'x'
or '+'
(the letter 'x' or a literal plus sign - most regex metacharacters lose their special meanings inside character classes)
(a-z)
matches the sequence "a-z"
('a'
, hyphen, 'z'
)
[a-z]
matches any one character in the range a
through z
inclusive
(\d)
matches a digit - \d
is a shorthand for [0-9]
(ASCII semantics) or \p{Nd}
(Unicode semantics; "decimal digit")
[\d]
matches a digit - unlike metacharacters, character-class shorthands retain their meanings inside "longhand" (or enumerated) character classes
(\d\d)
matches two digits
[\d\d]
matches one digit
A character class is an atom: it consumes exactly one character, the same as a literal character like x
or %
or ☺
does. But it allows you to define a set of characters, and it consumes the next character if it's a member of that set. (Specifying the same character more than once has no effect: [abracadabra]
consumes one character from the set {'a', 'b', 'c', 'd', 'r'}
.)
A group encloses one or more atoms, allowing them to be handled like a single atom:
abc?
consumes an 'a'
, followed by a 'b
', and the next character if it happens to be 'c'
.
(abc)?
consumes "abc"
or nothing.
And while there are many kinds of groups, serving different purposes, none of them is equivalent to a character class. You can use alternation inside a group to achieve similar results--for example, (a|b|c)
will match the same thing as [abc]
--but it's inherently less efficient, as well as less readable. In fact, it can easily lead to catastrophe, as this answer explains. If you have a choice between a character class and an alternation, you should always go with the character class. If you need to capture the character, wrap the class in parens: ([abc])
.