I know many answers exist to the question on how to split up a string respecting parenthesis, but they never do so recursively.
Looking at the string 1 2 3 (test 0, test 0) (test (0 test) 0)
:
Regex \s(?![^\(]*\))
returns "1", "2", "3", "(test 0, test 0)", "(test", "(0 test) 0)"
The regex I'm looking for would return either
"1", "2", "3", "(test 0, test 0)", "(test (0 test)0)"
or
"1", "2", "3", "test 0, test 0", "test (0 test)0"
which would let me recursively use it on the results again until no parentheses remain.
Ideally it would also respect escaped parentheses, but I myself am not this advanced in regex knowing only the basics.
Does anyone have an idea on how to take on this?
Asked
Active
Viewed 626 times
2

CodeSpoof
- 55
- 4
-
What makes you think `regex` is the right tool for this problem? – Scott Hunter Dec 23 '21 at 19:43
-
1When components of your string have semantic value, such as balanced parenthesis, you're better off tokenizing and parsing. Regular expressions can be a component of your lexer/tokenizer, but aren't optimal for doing the entire job. – DavidO Dec 23 '21 at 19:43
-
Often it helps to choose a sustainable solution, when we know the context/background: Where do these string expressions formed of numbers and parentheses originate from? What do they represent? – hc_dev Dec 23 '21 at 19:54
3 Answers
2
Using regex
only for the task might work but it wouldn't be straightforward.
Another possibility is writing a simple algorithm to track the parentheses in the string:
- Split the string at all parentheses, while returning the delimiter (e.g. using
re.split
) - Keep a counters tracking the parentheses:
start_parens_count
for(
andend_parens_count
for)
. - Using the counters, proceed by either splitting at white spaces or adding the current data into a temp var (
term
) - When the left most parenthesis has been closed, append
term
to the list of values & reset the counters/temp vars.
Here's an example:
import re
string = "1 2 3 (test 0, test 0) (test (0 test) 0)"
result, start_parens_count, end_parens_count, term = [], 0, 0, ""
for x in re.split(r"([()])", string):
if not x.strip():
continue
elif x == "(":
if start_parens_count > 0:
term += "("
start_parens_count += 1
elif x == ")":
end_parens_count += 1
if end_parens_count == start_parens_count:
result.append(term)
end_parens_count, start_parens_count, term = 0, 0, ""
else:
term += ")"
elif start_parens_count > end_parens_count:
term += x
else:
result.extend(x.strip(" ").split(" "))
print(result)
# ['1', '2', '3', 'test 0, test 0', 'test (0 test) 0']
Not very elegant, but works.

niko
- 5,253
- 1
- 12
- 32
-
The parsing algorithm is a solid approach and well explained together with expressive code. – hc_dev Dec 26 '21 at 10:44
1
You can use pip install regex
and use
import regex
text = "1 2 3 (test 0, test 0) (test (0 test) 0)"
matches = [match.group() for match in regex.finditer(r"(?:(\((?>[^()]+|(?1))*\))|\S)+", text)]
print(matches)
# => ['1', '2', '3', '(test 0, test 0)', '(test (0 test) 0)']
See the online Python demo. See the regex demo. The regex matches:
(?:
- start of a non-capturing group:(\((?>[^()]+|(?1))*\))
- a text between any nested parentheses
|
- or\S
- any non-whitespace char
)+
- end of the group, repeat one or more times

Wiktor Stribiżew
- 607,720
- 39
- 448
- 563
-
Did you mean the last quantifier to be inside the non-capturing group? – oriberu Dec 23 '21 at 23:49
-
@oriberu No, but what you suggest would look like `(\((?>[^()]+|(?1))*\))|\S+`. – Wiktor Stribiżew Dec 24 '21 at 01:18
-
I was wondering, because quantifying the whole expression should not change your result, while quantifying the non-whitespace group allows it to match more than one character (presuming non-parenthesized sequences could be more than one character long, even if that wasn't in the test data). – oriberu Dec 24 '21 at 07:48
1
Alternatively, you can use pyparsing as well.
import pyparsing as pp
pattern = pp.ZeroOrMore(pp.Regex(r'\S+') ^ pp.original_text_for(pp.nested_expr('(', ')')))
# Tests
string = '1 2 3 (test 0, test 0) (test (0 test) 0)'
result = pattern.parse_string(string).as_list()
answer = ['1', '2', '3', '(test 0, test 0)', '(test (0 test) 0)']
assert result == answer
string = ''
result = pattern.parse_string(string).as_list()
answer = []
assert result == answer
string = 'a'
result = pattern.parse_string(string).as_list()
answer = ['a']
assert result == answer
string = ' a (1) ! '
result = pattern.parse_string(string).as_list()
answer = ['a', '(1)', '!']
assert result == answer
string = ' a (b) cd (e f) g hi (j (k l) m) (o p (qr (s t) u v) w (x y) z)'
result = pattern.parse_string(string).as_list()
answer = ['a', '(b)', 'cd', '(e f)', 'g', 'hi', '(j (k l) m)', '(o p (qr (s t) u v) w (x y) z)']
assert result == answer
* pyparsing
can be installed by pip install pyparsing
In addition, you can directly parse all the nested parentheses at once:
pattern = pp.ZeroOrMore(pp.Regex(r'\S+') ^ pp.nested_expr('(', ')'))
string = '1 2 3 (test 0, test 0) (test (0 test) 0)'
result = pattern.parse_string(string).as_list()
answer = ['1', '2', '3', ['test', '0,', 'test', '0'], ['test', ['0', 'test'], '0']]
assert result == answer
* Whitespace is a delimiter in this case.
Note:
If a pair of parentheses gets broken inside ()
(for example a(b(c)
, a(b)c)
, etc), an unexpected result is obtained or IndexError
is raised. So be careful to use. (See: Python extract string in a phrase)

quasi-human
- 1,898
- 1
- 2
- 13