Regular expression to return string split up respecting nested parentheses

Question

I know many answers exist to the question on how to split up a string respecting parenthesis, but they never do so recursively. Looking at the string 1 2 3 (test 0, test 0) (test (0 test) 0):
Regex \s(?![^\(]*\)) returns "1", "2", "3", "(test 0, test 0)", "(test", "(0 test) 0)"
The regex I'm looking for would return either
"1", "2", "3", "(test 0, test 0)", "(test (0 test)0)"
or
"1", "2", "3", "test 0, test 0", "test (0 test)0"
which would let me recursively use it on the results again until no parentheses remain.
Ideally it would also respect escaped parentheses, but I myself am not this advanced in regex knowing only the basics.
Does anyone have an idea on how to take on this?

What makes you think `regex` is the right tool for this problem? — Scott Hunter, Dec 23 '21 at 19:43
When components of your string have semantic value, such as balanced parenthesis, you're better off tokenizing and parsing. Regular expressions can be a component of your lexer/tokenizer, but aren't optimal for doing the entire job. — DavidO, Dec 23 '21 at 19:43
Often it helps to choose a sustainable solution, when we know the context/background: Where do these string expressions formed of numbers and parentheses originate from? What do they represent? — hc_dev, Dec 23 '21 at 19:54

niko · Accepted Answer · 2021-12-25T07:36:27.317

Using regex only for the task might work but it wouldn't be straightforward.

Another possibility is writing a simple algorithm to track the parentheses in the string:

Split the string at all parentheses, while returning the delimiter (e.g. using re.split)
Keep a counters tracking the parentheses: start_parens_count for ( and end_parens_count for ).
Using the counters, proceed by either splitting at white spaces or adding the current data into a temp var ( term)
When the left most parenthesis has been closed, append term to the list of values & reset the counters/temp vars.

Here's an example:

import re

string = "1 2 3 (test 0, test 0) (test (0 test) 0)"


result, start_parens_count, end_parens_count, term = [], 0, 0, ""
for x in re.split(r"([()])", string):
    if not x.strip():
        continue
    elif x == "(":
        if start_parens_count > 0:
            term += "("
        start_parens_count += 1
    elif x == ")":
        end_parens_count += 1
        if end_parens_count == start_parens_count:
            result.append(term)
            end_parens_count, start_parens_count, term = 0, 0, ""
        else:
            term += ")"
    elif start_parens_count > end_parens_count:
        term += x
    else:
        result.extend(x.strip(" ").split(" "))


print(result)
# ['1', '2', '3', 'test 0, test 0', 'test (0 test) 0']

Not very elegant, but works.

The parsing algorithm is a solid approach and well explained together with expressive code. — hc_dev, Dec 26 '21 at 10:44

score 1 · Answer 2 · answered Dec 23 '21 at 21:43

1

You can use pip install regex and use

import regex
text = "1 2 3 (test 0, test 0) (test (0 test) 0)"
matches = [match.group() for match in regex.finditer(r"(?:(\((?>[^()]+|(?1))*\))|\S)+", text)]
print(matches)
# => ['1', '2', '3', '(test 0, test 0)', '(test (0 test) 0)']

See the online Python demo. See the regex demo. The regex matches:

(?: - start of a non-capturing group:
- (\((?>[^()]+|(?1))*\)) - a text between any nested parentheses
| - or
- \S - any non-whitespace char
)+ - end of the group, repeat one or more times

answered Dec 23 '21 at 21:43

Wiktor Stribiżew

607,720
39
448
563

Did you mean the last quantifier to be inside the non-capturing group? – oriberu Dec 23 '21 at 23:49
@oriberu No, but what you suggest would look like `(\((?>[^()]+|(?1))*\))|\S+`. – Wiktor Stribiżew Dec 24 '21 at 01:18
I was wondering, because quantifying the whole expression should not change your result, while quantifying the non-whitespace group allows it to match more than one character (presuming non-parenthesized sequences could be more than one character long, even if that wasn't in the test data). – oriberu Dec 24 '21 at 07:48

quasi-human · Answer 3 · 2022-02-06T15:06:41.347

Alternatively, you can use pyparsing as well.

import pyparsing as pp

pattern = pp.ZeroOrMore(pp.Regex(r'\S+') ^ pp.original_text_for(pp.nested_expr('(', ')')))

# Tests
string = '1 2 3 (test 0, test 0) (test (0 test) 0)'
result = pattern.parse_string(string).as_list()
answer = ['1', '2', '3', '(test 0, test 0)', '(test (0 test) 0)']
assert result == answer

string = ''
result = pattern.parse_string(string).as_list()
answer = []
assert result == answer

string = 'a'
result = pattern.parse_string(string).as_list()
answer = ['a']
assert result == answer

string = ' a (1) ! '
result = pattern.parse_string(string).as_list()
answer = ['a', '(1)', '!']
assert result == answer

string = ' a (b) cd (e f) g hi (j (k l) m) (o p (qr (s t) u v) w (x y) z)'
result = pattern.parse_string(string).as_list()
answer = ['a', '(b)', 'cd', '(e f)', 'g', 'hi', '(j (k l) m)', '(o p (qr (s t) u v) w (x y) z)']
assert result == answer

* pyparsing can be installed by pip install pyparsing

In addition, you can directly parse all the nested parentheses at once:

pattern = pp.ZeroOrMore(pp.Regex(r'\S+') ^ pp.nested_expr('(', ')'))

string = '1 2 3 (test 0, test 0) (test (0 test) 0)'
result = pattern.parse_string(string).as_list()
answer = ['1', '2', '3', ['test', '0,', 'test', '0'], ['test', ['0', 'test'], '0']]
assert result == answer

* Whitespace is a delimiter in this case.

Note:

If a pair of parentheses gets broken inside () (for example a(b(c), a(b)c), etc), an unexpected result is obtained or IndexError is raised. So be careful to use. (See: Python extract string in a phrase)

Regular expression to return string split up respecting nested parentheses

3 Answers3

Note: