3

I have a several strings that I want to split by spaces when not inside parentheses.

For example

sentence = "blah (blah2 (blah3))|blah4 blah5"

should produce

["blah", "(blah2 (blah3))|blah4", "blah5"]

I've tried:

re.split(r"\s+(?=[^()]*(?:\(|$))", sentence)

but it produces:

['blah', '(blah2', '(blah3))|blah4', 'blah5']
Optimus
  • 1,354
  • 1
  • 21
  • 40
  • 9
    Regular expressions cannot handle matching (nested) parentheses. It is one of the canonical examples that requires a context-free grammar. – user2390182 Feb 06 '17 at 14:40
  • Can there be an open or closing parentheses without a matching closing/open parentheses? Should this situation produce an error? – Rick Feb 06 '17 at 14:40
  • There can be. It should produce an error. – Optimus Feb 06 '17 at 14:43
  • 1
    You really need to use a stack to maintain if you're inside brackets or outside. Once you've eliminated the stuff inside brackets, you can then use a regex split. – ffledgling Feb 06 '17 at 14:45
  • @ffledgling I don't think a stack is really needed, or at least maybe it's not called that. How about an integer to indicate depth? – Moon Cheesez Feb 06 '17 at 14:47
  • @Optimus Do you mind if the spaces inside the brackets are removed? – Mohammad Yusuf Feb 06 '17 at 14:57

2 Answers2

5

As said in the comments, it's impossible to process that using regex because of parenthesis nesting.

An alternative would be some good old string processing with nesting count on parentheses:

def parenthesis_split(sentence,separator=" ",lparen="(",rparen=")"):
    nb_brackets=0
    sentence = sentence.strip(separator) # get rid of leading/trailing seps

    l=[0]
    for i,c in enumerate(sentence):
        if c==lparen:
            nb_brackets+=1
        elif c==rparen:
            nb_brackets-=1
        elif c==separator and nb_brackets==0:
            l.append(i)
        # handle malformed string
        if nb_brackets<0:
            raise Exception("Syntax error")

    l.append(len(sentence))
    # handle missing closing parentheses
    if nb_brackets>0:
        raise Exception("Syntax error")


    return([sentence[i:j].strip(separator) for i,j in zip(l,l[1:])])

print(parenthesis_split("blah (blah2 (blah3))|blah4 blah5"))

result:

['blah', '(blah2 (blah3))|blah4', 'blah5']

l contains the indexes of the string where a non-paren protected space occurs. In the end, generate the array by slicing the list.

note the strip() in the end to handle multiple separator occurrences, and at the start to remove leading/trailing separators which would create empty items in the returned list.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
2

While it is true that the re module cannot handle recursion, the PyPi regex module can (to some extent). Just to show how advanced regex can work, here is the 2-regex approach: one validates the balanced parentheses and the second extracts the tokens:

>>> import regex
>>> sentence = "blah (blah2 (blah3))|blah4 blah5"
>>> reg_extract = regex.compile(r'(?:(\((?>[^()]+|(?1))*\))|\S)+')
>>> reg_validate = regex.compile(r'^[^()]*(\((?>[^()]+|(?1))*\)[^()]*)+$')
>>> res = []
>>> if reg_validate.fullmatch(sentence):
    res = [x.group() for x in reg_extract.finditer(sentence)]

>>> print(res)
['blah', '(blah2 (blah3))|blah4', 'blah5']

Extraction regex details: matches 1 or more occurrences of

  • (\((?>[^()]+|(?1))*\)) - Capturing group 1 matching either 1+ chars other than ( and ) (with [^()]+) or (|) (?1) recurses the whole capturing group 1 pattern (recursion occurs)
  • | - or
  • \S - a non-whitespace char

Validation regex details:

  • ^ - start of string
  • [^()]* - 0+ chars other than ( and )
  • ( - Group 1 capturing 1 or more occurrences of:
    • \( - opening ( symbol
    • (?>[^()]+|(?1))* - 0+ occurrences of
      • [^()]+ - 1+ chars other than ( and )
      • | - or
      • (?1) - Subroutine call recursing Group 1 subpattern (recursion)
    • \) - a closing )
    • [^()]* - 0+ chars other than ( and )
  • )+ - end of Group 1
  • $ - end of string
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563