0

This is the regexp I created so far: \((.+?)\)

This is my test string: (2+2) + (2+3*(2+3))

The matches I get are:

(2+2) And (2+3*(2+3)

I want my matches to be:

(2+2) And (2+3*(2+3))

How should I modify my regular expression?

turnt
  • 3,235
  • 5
  • 23
  • 39
  • 5
    Regular expressions are not a good option for parsing recursive structures like this. – p.s.w.g Aug 22 '13 at 00:06
  • This is python, but it will point you in the direction of a workable (non-regex) approach: http://stackoverflow.com/questions/524548/regular-expression-to-detect-semi-colon-terminated-c-for-while-loops/524624#524624 – jwrush Aug 22 '13 at 00:23

2 Answers2

4

You cannot parse parentesized expressions with regular expression. There is a mathematical proof that regular expressions can't do this. Parenthesized expressions are a context-free grammar, and can thus be recognized by pushdown automata (stack-machines).

You can, anyway, define a regular expression that will work on any expression with less than N parentheses, with an arbitrary finite N (even though the expression will get complex). You just need to acknowledge that your parentheses might contain another arbitrary number of parenteses.

\(([^()]+(\([^)]+\)[^)]*)*)\)

It works like this:

  1. \(([^()]+ matches an open parenthesis, follwed by whatever is not a parenthesis;
  2. (\([^)]+\)[^)]*)* optionally, there may be another group, formed by an open parenthesis, with something inside it, followed by a matching closing parenthesis. Some other non-parenthesis character may follow. This can be repeated an arbitrary amount of times. Anyway, at last, there must be
  3. )\) another closed parenthesis, which matches with the first one.

This should work for nesting depth 2. If you want nesting depth 3, you have to further recurse, allowing each of the groups I described at point (2) to have a nested parenthesized group.

Things will get much easier if you use a stack. Such as:

foundMatches = [];
mStack = [];
start = RegExp("\\(");
mid = RegExp("[^()]*[()]?");
idx = 0;
while ((idx = input.search(start.substr(idx))) != -1) {
    mStack.push(idx);
    //Start a search
    nidx = input.substr(idx + 1).search(mid);
    while (nidx != -1 && idx + nidx < input.length) {
        idx += nidx;
        match = input.substr(idx).match(mid);
        match = match[0].substr(-1);
        if (match == "(") {
            mStack.push(idx);
        } else if (mStack.length == 1) {
            break;
        }
        nidx = input.substr(idx + 1).search(mid);
    }
    //Check the result
    if (nidx != -1 && idx + nidx < input.length) {
        //idx+nidx is the index of the last ")"
        idx += nidx;
        //The stack contains the index of the first "("
        startIdx = mStack.pop();
        foundMatches.push(input.substr(startIdx, idx + 1 - startIdx));
    }
    idx += 1;
}
Giulio Franco
  • 3,170
  • 15
  • 18
  • Of course, the stack machine is not worth it if you don't plan to parse expression with more than 2-3 nested parentheses – Giulio Franco Aug 22 '13 at 00:58
1

How about you parse it yourself using a loop without the help of regex? Here is one simple way:

  • You would have to have a variable, say "level", which keeps track of how many open parentheses you have come across so far (initialize it with a 0).
  • You would also need a string buffer to contain each of your matches ( e.g. (2+2) or (2+3 * (2+3)) ) .
  • Finally, you would need somewhere you can dump the contents of your buffer into whenever you finish reading a match.

  • As you read the string character by character, you would increment level by 1 when you come across "(", and decrement by 1 when you come across ")". You would then put the character into the buffer.

  • When you come across ")" AND the level happens to hit 0, that is when you know you have a match. This is when you would dump the contents of the buffer and continue.

This method assumes that whenever you have a "(" there will always be a corresponding ")" in the input string. This method will handle arbitrary number of parentheses.

Joohwan
  • 2,374
  • 1
  • 19
  • 30