2

With Javascript, suppose I have a string like (1)(((2)(3))4), can I get a regex to match just (1) and (((2)(3))4), or do I need to do something more complicated?

Ideally the regex would return ["((2)(3))","4"] if you searched ((2)(3))4. Actually that's really a requirement. The point is to group things into the chunks that need to be worked on first, like the way parentheses work in math.

Moss
  • 3,695
  • 6
  • 40
  • 60
  • No, unless you know the input's nesting level is restricted to a certain level of nesting. Practically: **NO**. http://stackoverflow.com/a/1732454/1073695 – Jo So Aug 15 '14 at 00:22
  • What about the recursiveSearch addon here http://xregexp.com/plugins/? It looks like it is almost what I need but how would I make it match things that aren't in parentheses? – Moss Aug 15 '14 at 00:52
  • Know what you talk about. As stated in my answer, true regexes can't handle recursive structures. The fancy extensions like those including backtracking are not technically regexes anymore. Furthermore, they can't be implemented efficiently, and such implementations are usually also very slow when given a pure regex. Highly recommended: http://swtch.com/~rsc/regexp/regexp1.html – Jo So Aug 15 '14 at 01:22

3 Answers3

2

No, there is no way to match only top level parentheses with regex

Looking only at the top level doesn't make the problem easier than general "parsing" of recursive structures. (See this relevant popular SO question with a great answer).

Here's a simple intuitive reason why Regex can't parse arbitrary levels of nesting:

To keep track of the level of nesting, one must count. If one wants to be able to keep track of an arbitrary level of nesting, one needs an arbitrarily large number while running the program.

But regular expressions are exactly those that can be implemented by DFAs, that is Deterministice finite automatons. These have only a finite number of states. Thus they can't keep track of an arbitrarily large number.

This argument works also for your specific concern of being only interested in the top level parentheses.

To recognize the top level parentheses, you must keep track of arbitrary nesting preceding any one of them:

((((..arbitrarily deep nesting...))))((.....)).......()......
^toplevel                           ^^       ^       ^^

So yes, you need something more powerful than regex.


While if you are very pragmatic, for your concrete application it might be okay to say that you won't encounter any nesting deeper than, say, 1000 (and so you might be willing to go with regex), it's also a very practical fact that any regex recognizing a nesting level of more than 2 is basically unreadable.

Community
  • 1
  • 1
Jo So
  • 25,005
  • 6
  • 42
  • 59
  • Perhaps you could suggest a way to do it without regex or with regex + other stuff. – Moss Aug 15 '14 at 00:45
  • 2
    Just write a few lines of code. Initialize a counter `nesting_level` to `0`. Go through your input string from left to right, increasing the counter whenever you encounter `(` and decreasing it whenever you encounter `)`. Those `(` that you encounter while `nesting_level` was set to `0`, and those `)` that you encounter while `nesting_level` was set to `1` (i.e those that effect decreasing the counter to 0), are the top level parentheses – Jo So Aug 15 '14 at 01:40
2

Well, here is one way to do it. As Jo So pointed out, you can't really do it in javascript with indefinite amounts of recursion, but you can make something arbitrarily recursive pretty easily. I'm not sure how the performance scales though.

First I figured out that you need recursion. Then I realized that you can just make your regex 'recursive' by just copying and pasting recursively, like so (using curly braces for clarity):

Starting regex

Finds stuff in brackets that isn't itself brackets.

/{([^{}])*}/g

Then copy and paste the whole regex inside itself! (I spaced it out so you can see where it was pasted in.) So now it is basically like a( x | a( x )b )b

/{([^{}] | {([^{}])*} )*}/g

That will get you one level of recursion and you can continue ad nauseum in this fashion and actually double the amount of recursions each time:

//matches {4{3{2{1}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*}/g

//matches {8{7{6{5{4{3{2{1}}}}}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*})*})*})*})*}/g

Finally I just add |[^{}]+ on the end of the expression to match stuff that is completely outside of brackets. Crazy, but it works for my needs. I feel like there is probably some clever way to combine this concept with a recursive function in order to get a truly recursive matcher, but I can't think of it now.

Moss
  • 3,695
  • 6
  • 40
  • 60
  • 1
    If recursion is needed, regex should be avoided. – scrowler Aug 15 '14 at 02:00
  • Is this really bad for performance? (with say 3 or 4 levels?) – Moss Aug 15 '14 at 02:07
  • I'm just a regex hater. I use it when I have to, but I'm totally of the opinion that it [should be avoided](http://programmers.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions) wherever possible. Just me (and a few others...). – scrowler Aug 15 '14 at 02:23
  • Nitpick: "infinite" or "arbitrary amounts of" recursion do not exist. Recursion is by definition an "infinite" concept. If something is "limited", even "arbitrarily limited", it is not recursive. – Jo So Aug 15 '14 at 11:44
  • This answer illustrates what I meant by `any regex recognizing a nesting level of more than 2 is basically unreadable`. – Jo So Aug 15 '14 at 11:45
  • It doesn't matter so much that it is unreadable, as long as it works. But in the end I realized my requirements were more complicated than what I asked about here, so I ended up with a different solution and did away with parentheses altogether. – Moss Aug 15 '14 at 17:55
0

If you can be sure that the parentheses are balanced (I'm sure there are other resources out there that can answer that question for you if required) and if by "top-level" you're happy to find local as well as global maxima then all you need to do is find any content that starts with an open bracket and closes with a close-bracket, with no intermediate open-bracket between the two:

I think the following should do that for you and helpfully group any "top-level" content:

\(([^\(]*?)\)

That content may not all be at the same "level", but if you think of the nested brackets as describing the branching of a tree, the regex will return to you the leaves. If you pre-process your text to be wrapped in parentheses to start with, and the earlier assumptions are met, you can guarantee always getting at least one "leaf".

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
  • That is actually an answer to the opposite of what the question was asking. This finds only the bottom level parentheses. – Moss Jan 11 '19 at 00:50
  • I guess it depends on where you start counting from - but given the stated assumptions - this pulls out those parts of the expression that should be performed first. I.e. the most deeply nested expressions. – Thomas Kimber Jan 11 '19 at 00:54
  • Regardless of which way you count, it is the opposite of what I asked for. But it may be useful for people with a similar question to consider whether they need to break apart the largest chunks first and work top down, or if they can do as you suggest and work bottom up. – Moss Jan 13 '19 at 02:22