1

This regular expression

/\(.*\)/

won't match the matching parenthesis but the last parenthesis in the string. Is there a regular expression extension, or something similar, with a proper syntax that allows for this? For example:

there are (many (things (on) the)) box (except (carrots (and apples)))

/OPEN(.*CLOSE)/ should match (many (things (on) the))

There could be infinite levels of parentheses.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
MaiaVictor
  • 51,090
  • 44
  • 144
  • 286

3 Answers3

7

If you only have one level of parentheses, then there are two possibilities.

Option 1: use ungreedy repetition:

/\(.*?\)/

This will stop when it encounters the first ).

Option 2: use a negative character class

/\([^)]*\)/

This can only repeat characters that are not ), so it can necessarily never go past the first closing parenthesis. This option is usually preferred due to performance reasons. In addition, this option is more easily extended to allow for escaping parenthesis (so that you could match this complete string: (some\)thing) instead of throwing away thing)). But this is probably rather rarely necessary.

However if you want nested structures, this is generally too complicated for regex (although some flavors like PCRE support recursive patterns). In this case, you should just go through the string yourself and count parentheses, to keep track of your current nesting level.

Just as a side note about those recursive patterns: In PCRE (?R) simply represents the whole pattern, so inserting this somewhere makes the whole thing recursive. But then every content of parentheses must be of the same structure as the whole match. Also, it is not really possible to do meaningful one-step replacements with this, as well as using capturing groups on multiple nested levels. All in all - you are best off, not to use regular expressions for nested structures.

Update: Since you seem eager to find a regex solution, here is how you would match your example using PCRE (example implementation in PHP):

$str = 'there are (many (things (on) the)) box (except (carrots (and apples)))';
preg_match_all('/\([^()]*(?:(?R)[^()]*)*\)/', $str, $matches);
print_r($matches);

results in

Array
(
    [0] => Array
        (
            [0] => (many (things (on) the))
            [1] => (except (carrots (and apples)))
        )   
)

What the pattern does:

\(      # opening bracket
[^()]*  # arbitrarily many non-bracket characters
(?:     # start a non-capturing group for later repetition
(?R)    # recursion! (match any nested brackets)
[^()]*  # arbitrarily many non-bracket characters
)*      # close the group and repeat it arbitrarily many times
\)      # closing bracket

This allows for infinite nested levels and also for infinite parallel levels.

Note that it is not possible to get all nested levels as separate captured groups. You will always just get the inner-most or outer-most group. Also, doing a recursive replacement is not possible like this.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Wow. _Unrolling-the-Recursive-Loop!_ I just learned something new - thanks. Big +1. (I guess you _have_ read [MRE3](http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124 "Mastering Regular Expressions (3rd Edition)")!) – ridgerunner Oct 28 '12 at 21:05
  • @ridgerunner, no I have not ;). Just applying PHP's PCRE documentation :D. However, I don't know what you are referring to here. Doesn't your answer work basically the same? – Martin Ender Oct 28 '12 at 21:10
  • Yes they are functionally equivalent. But the `{normal* (special normal*)*}` structure you have implemented here is a perfect example of Friedl's advanced _"Unrolling-the-Loop"_ efficiency technique. (You've effectively taken my expression and made it go faster by eliminating the unnecessary alternation (which is comparatively slow)). (Note that [the Slashdot review of MRE](http://slashdot.org/story/06/09/13/147213/mastering-regular-expressions) rated it 11 out of 10!) – ridgerunner Oct 28 '12 at 21:20
  • @ridgerunner Ah okay. That was lucky then, because it was simply the first implementation that came to my mind (and I actually prefer yours as far as readability goes). But yeah, now that you mention it makes sense that the alternation is less efficient (especially if you forget to make that repetition possessive). – Martin Ender Oct 28 '12 at 21:26
  • @ridgerunner I've finally read it! He does in fact unroll the recursive loop himself on page 477 ;) – Martin Ender Jun 28 '13 at 15:30
  • _Excellent!_ Glad to hear you read it. Keep it around - some of the subtler stuff takes several readings before it finally sinks in. – ridgerunner Jun 29 '13 at 14:45
2

Regular expressions are not powerful enough to find matching parentheses, because parentheses are nested structures. There exists a simple algorithm to find matching parentheses, though, which is described in this answer.

If you are just trying to find the first right parenthesis in an expression, you should use a non-greedy matcher in your regex. In this case, the non-greedy version of your regex is the following:

/\(.*?\)/
Community
  • 1
  • 1
Jordan Lewis
  • 16,900
  • 4
  • 29
  • 46
  • There is a simple algorithm for finding the solution, but I'm insterested in using them inside something similar to Regex. – MaiaVictor Oct 28 '12 at 20:19
  • Unfortunately, regexes aren't powerful enough to deal with nested structures in the way that you're talking about. Regular expressions correspond to a kind of machine called a deterministic finite automaton, which has a fundamental limitation: it can't remember any 'extra' state like, in your case, how many levels deep of parentheses you are. There is another kind of automaton called a push-down automaton that has the facility that you need, but I don't believe there is a simple language like regular expression for push-down automata. – Jordan Lewis Oct 28 '12 at 20:26
  • @Dokkat, while I agree with Jordan in all points, check out my updated answer, it includes a regex solution that works in some engines. – Martin Ender Oct 28 '12 at 20:48
  • Regular expressions are certainly powerful enough to match the _innermost_ matched sets of parentheses with `/\([^()]*\)/`. And those regex engines having (non-REGULAR) recursive patterns (i.e. Perl, PHP/PCRE, .NET) can easily match the outermost matched pairs as well. – ridgerunner Oct 28 '12 at 20:55
1

Given a string containing nested matching parentheses, you can either match the innermost sets with this (non-recursive JavaScript) regex:

var re = /\([^()]*\)/g;

Or you can match the outermost sets with this (recursive PHP) regex:

$re = '/\((?:[^()]++|(?R))*\)/';

But you cannot easily match sets of matching parentheses that are in-between the innermost and outermost.

Note also that the (naive and frequently encountered) expression: /\(.*?\)/ will always match incorrectly (neither the innermost nor outermost matched sets).

ridgerunner
  • 33,777
  • 5
  • 57
  • 69