0

How do I split a string into groups when but only when the parens are balanced?

For example, "(Small Business (SB), Women-Owned Small Business (WOSB)), (8(a))" into ["(Small Business (SB), Women-Owned Small Business (WOSB))", "(8(a))"]?

spitfiredd
  • 2,897
  • 5
  • 32
  • 75

2 Answers2

1

These are really hard (impossible?) to do with regex, so maybe just write a little loop, something like:

def split(s):
    start = 0
    nest = 0
    for i, char in enumerate(s):
        if char == "(":
            nest += 1
        elif char == ")":
             nest -= 1
        elif char == "," and nest == 0:
            yield s[start:i].strip()
            start = i + 1
    yield s[start:].strip()

list(split(s))
['(Small Business (SB), Women-Owned Small Business (WOSB))', '(8(a))']
spitfiredd
  • 2,897
  • 5
  • 32
  • 75
wim
  • 338,267
  • 99
  • 616
  • 750
  • You can't do this with regular expressions for arbitrary strings -- a regular expression isn't powerful enough to determine whether a string contains balanced parentheses. – BrokenBenchmark Mar 18 '22 at 02:49
  • @BrokenBenchmark According to [this](https://stackoverflow.com/q/546433/674039) it looks like it may be possible, but pretty difficult (and may require a more powerful regex engine than Python's stdlib one). – wim Mar 18 '22 at 03:48
  • That's largely because some languages have regex engines that allow features that regular expressions (as formally defined) [don't normally have](https://en.wikipedia.org/wiki/Regular_language). For example, the first regular expression in the question you've linked uses a depth counter. – BrokenBenchmark Mar 18 '22 at 03:54
1

Similar to wim's, but using itertools.groupby:

from itertools import groupby

def split(s):
    nest = 0
    def splitter(c):
        nonlocal nest
        if c == ',':
            return nest == 0
        if c == '(':
            nest += 1
        elif c == ')':
            nest -= 1
        return False
    return [''.join(g).strip()
            for k, g in groupby(s, splitter)
            if not k]

s = "(Small Business (SB), Women-Owned Small Business (WOSB)), (8(a))"
print(split(s))

Output:

['(Small Business (SB), Women-Owned Small Business (WOSB))', '(8(a))']
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65