1

I'm learning regular expressions on my own. So far, I seem able to achieve everything I want using square brackets (i.e []), in other people's code I often see parantheses used, and I'm wondering what are some good cases / uses of paranthesis.

Can I have some examples?

Ali
  • 261,656
  • 265
  • 575
  • 769

2 Answers2

5

This is common error among regex beginners, and it's a serious one. Square brackets are used to create character classes, while parentheses create groups. Not only do these constructs serve different purposes, there is no overlap in their functions. In particular, square brackets are not used for grouping. Here are some examples to illustrate:

(abc) matches the sequence "abc"
[abc] matches one of the characters 'a', 'b', or 'c'.

(abc)+ matches abc one or more times ("abc", "abcabc", etc.)
[abc]+ matches one or more characters from the set {'a', 'b', 'c'} ("a", "cc", "baccbcaab", etc.)

(x+) matches at least one 'x' ("x", "xx", "xxxxxxxx", etc.)
[x+] matches 'x' or '+' (the letter 'x' or a literal plus sign - most regex metacharacters lose their special meanings inside character classes)

(a-z) matches the sequence "a-z" ('a', hyphen, 'z')
[a-z] matches any one character in the range a through z inclusive

(\d) matches a digit - \d is a shorthand for [0-9] (ASCII semantics) or \p{Nd} (Unicode semantics; "decimal digit")
[\d] matches a digit - unlike metacharacters, character-class shorthands retain their meanings inside "longhand" (or enumerated) character classes

(\d\d) matches two digits
[\d\d] matches one digit

A character class is an atom: it consumes exactly one character, the same as a literal character like x or % or does. But it allows you to define a set of characters, and it consumes the next character if it's a member of that set. (Specifying the same character more than once has no effect: [abracadabra] consumes one character from the set {'a', 'b', 'c', 'd', 'r'}.)

A group encloses one or more atoms, allowing them to be handled like a single atom:

  • abc? consumes an 'a', followed by a 'b', and the next character if it happens to be 'c'.
  • (abc)? consumes "abc" or nothing.

And while there are many kinds of groups, serving different purposes, none of them is equivalent to a character class. You can use alternation inside a group to achieve similar results--for example, (a|b|c) will match the same thing as [abc]--but it's inherently less efficient, as well as less readable. In fact, it can easily lead to catastrophe, as this answer explains. If you have a choice between a character class and an alternation, you should always go with the character class. If you need to capture the character, wrap the class in parens: ([abc]).

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Don't *all* regex metacharacters lose their meaning inside character classes, not just most of them? – Tim Pietzcker Apr 12 '12 at 09:13
  • Well, backslash is still the escape character, of course. And many references say `]` retains its meaning, though now I think about it, in most flavors it only *has* special meaning if it's part of a character class, same as `-`. So yes, I'd say "all" is just as valid as "most" in that statement, but it's more likely to provoke arguments. :D – Alan Moore Apr 12 '12 at 10:36
  • Is `([abc])` the same as `[abc]`? If so, is there any advantage to using the first over the 2nd? Does it not make it inefficient as you mentioned using alternation is more inefficient than just using square brackets? – Ali Apr 12 '12 at 11:28
  • The parens in that case are just to capture whatever the character class matches. For example, if you wanted to match two consecutive `'a'`s or `'b'`s (but not `"ab"` or `"ba"`), you could use `([ab])\1`. The `([ab])` matches an `'a'` or `'b'` and stores it in capture-group #1, and the `\1` matches another of whatever's in that group. – Alan Moore Apr 12 '12 at 13:08
  • As for the efficiency question, it was the alternation in `(a|b|c)` I was talking about, not the parens. But that one's pretty innocuous; what really kills you is when two or more alternatives can match the same characters, like in the example I pointed to where `.` and `\s` can both match the space character (among others). It may take a while to grok the reason for this; it did for me. :-/ In the meantime, if you find yourself able to use either a character class or an alternation and you don't see any reason to choose one over the other, trust me and go with the character class. – Alan Moore Apr 12 '12 at 13:31
1

Parentheses and brackets have totally different meanings in regexes.

Parentheses are used to group things, often so that the grouped text can be used later. For instance, (\w+) matches one or more word characters (letters, numbers, or underscores) and saves the text for later. How to access it depends on your programming language.

Non-capturing groups are also possible (they start with (?), however they tend to be used much less often.

Brackets denote a range of choices, e.g. [abc] matches anything with a, b, or c. [a-z] matches any lowercase letter. [a-zA-Z0-9] matches any lowercase letter, uppercase letter, or number.

They can be used together. ^([a-z]|_)+$ matches a string that contains only lowercase letters and underscores. That would probably be better written as ^[a-z_]+$ though.

Peter C
  • 6,219
  • 1
  • 25
  • 37