26

I'm trying to match a mathematical-expression-like string, that have nested parentheses.

import re

p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)

['(((1+0)+1)+1)']

I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don't even care if it matches unwanted ones like (((1+0), I can take care of those.

Why it's not doing that already, and how can I do it?

Cinco
  • 275
  • 1
  • 3
  • 6
  • 1
    Possible duplicate of [Matching Nested Structures With Regular Expressions in Python](https://stackoverflow.com/questions/1099178/matching-nested-structures-with-regular-expressions-in-python) – 0 _ Sep 16 '17 at 11:38

12 Answers12

30

As others have mentioned, regular expressions are not the way to go for nested constructs. I'll give a basic example using pyparsing:

import pyparsing # make sure you have this installed

thecontent = pyparsing.Word(pyparsing.alphanums) | '+' | '-'
parens     = pyparsing.nestedExpr( '(', ')', content=thecontent)

Here's a usage example:

>>> parens.parseString("((a + b) + c)")

Output:

(                          # all of str
 [
  (                        # ((a + b) + c)
   [
    (                      #  (a + b)
     ['a', '+', 'b'], {}   
    ),                     #  (a + b)      [closed]
    '+',
    'c'
   ], {}
  )                        # ((a + b) + c) [closed]
 ], {}  
)                          # all of str    [closed]

(With newlining/indenting/comments done manually)

To get the output in nested list format:

res = parens.parseString("((12 + 2) + 3)")
res.asList()

Output:

[[['12', '+', '2'], '+', '3']]
Neuron
  • 5,141
  • 5
  • 38
  • 59
phooji
  • 10,086
  • 2
  • 38
  • 45
  • 2
    Pyparsing returns ParseResults objects which have the repr output you describe. This class also supports `asList()`, which will convert the output to straight nested lists. If you print them using `pprint.pprint`, you should get nice indented output. Also, the `nestedExpr` method is implicitly recursive, so you can define contents in this case as just `thecontent = (pyparsing.Word(pyparsing.alphanums) | '+' | '-')` - no Forward'ing needed. – PaulMcG Mar 28 '11 at 04:28
  • @Paul McGuire: Thanks for the clarification! As you can probably tell, I am a little green wrt pyparsing. – phooji Mar 28 '11 at 04:50
  • This will fail for expressions like `(a + b) + (c + (d + e))` (not having a common root), while this would work `((a + b) + (c + (d + e)))`. – norok2 Jul 26 '19 at 11:30
24

There is a new regular engine module being prepared to replace the existing one in Python. It introduces a lot of new functionality, including recursive calls.

import regex

s = 'aaa(((1+0)+1)+1)bbb'

result = regex.search(r'''
(?<rec> #capturing group rec
 \( #open parenthesis
 (?: #non-capturing group
  [^()]++ #anyting but parenthesis one or more times without backtracking
  | #or
   (?&rec) #recursive substitute of group rec
 )*
 \) #close parenthesis
)
''',s,flags=regex.VERBOSE)


print(result.captures('rec'))

Output:

['(1+0)', '((1+0)+1)', '(((1+0)+1)+1)']

Related bug in regex: http://code.google.com/p/mrab-regex-hg/issues/detail?id=78

ovgolovin
  • 13,063
  • 6
  • 47
  • 78
  • 3
    No... not another iteration of Perl's "irregular regular expressions"! – ivan_pozdeev May 30 '16 at 14:58
  • Since there's no other matching to be done, you can use `(?0)` to refer to itself.. for given sample, `regex.findall(r'\((?:[^()]++|(?0))++\)', s)` gives `['(((1+0)+1)+1)']` – Sundeep Nov 17 '19 at 09:25
14

Regex languages aren't powerful enough to matching arbitrarily nested constructs. For that you need a push-down automaton (i.e., a parser). There are several such tools available, such as PLY.

Python also provides a parser library for its own syntax, which might do what you need. The output is extremely detailed, however, and takes a while to wrap your head around. If you're interested in this angle, the following discussion tries to explain things as simply as possible.

>>> import parser, pprint
>>> pprint.pprint(parser.st2list(parser.expr('(((1+0)+1)+1)')))
[258,
 [327,
  [304,
   [305,
    [306,
     [307,
      [308,
       [310,
        [311,
         [312,
          [313,
           [314,
            [315,
             [316,
              [317,
               [318,
                [7, '('],
                [320,
                 [304,
                  [305,
                   [306,
                    [307,
                     [308,
                      [310,
                       [311,
                        [312,
                         [313,
                          [314,
                           [315,
                            [316,
                             [317,
                              [318,
                               [7, '('],
                               [320,
                                [304,
                                 [305,
                                  [306,
                                   [307,
                                    [308,
                                     [310,
                                      [311,
                                       [312,
                                        [313,
                                         [314,
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [7,
                                               '('],
                                              [320,
                                               [304,
                                                [305,
                                                 [306,
                                                  [307,
                                                   [308,
                                                    [310,
                                                     [311,
                                                      [312,
                                                       [313,
                                                        [314,
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              '1']]]]],
                                                         [14,
                                                          '+'],
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              '0']]]]]]]]]]]]]]]],
                                              [8,
                                               ')']]]]],
                                          [14,
                                           '+'],
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [2,
                                               '1']]]]]]]]]]]]]]]],
                               [8, ')']]]]],
                           [14, '+'],
                           [315,
                            [316,
                             [317,
                              [318, [2, '1']]]]]]]]]]]]]]]],
                [8, ')']]]]]]]]]]]]]]]],
 [4, ''],
 [0, '']]

You can ease the pain with this short function:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [ast[0]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
[258,
 [318,
  '(',
  [314,
   [318, '(', [314, [318, '(', [314, '1', '+', '0'], ')'], '+', '1'], ')'],
   '+',
   '1'],
  ')'],
 '',
 '']

The numbers come from the Python modules symbol and token, which you can use to build a lookup table from numbers to names:

map = dict(token.tok_name.items() + symbol.sym_name.items())

You could even fold this mapping into the shallow() function so you can work with strings instead of numbers:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [map[ast[0]]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
['eval_input',
 ['atom',
  '(',
  ['arith_expr',
   ['atom',
    '(',
    ['arith_expr',
     ['atom', '(', ['arith_expr', '1', '+', '0'], ')'],
     '+',
     '1'],
    ')'],
   '+',
   '1'],
  ')'],
 '',
 '']
Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • I'd say PLY is better-suited for bigger parsing tasks. I've found it most useful when I really needed to mimic an existing Lex/Yacc parser. As for using python's built-in parser library... hmph. – phooji Mar 28 '11 at 04:06
  • @phooji: I agree. Python's parser library is *definitely* not for everyone. – Marcelo Cantos Mar 28 '11 at 05:47
  • not actually true, perl regex matching nested parens works fine: https://metacpan.org/pod/Regexp::Common python uses perl regexes – Erik Aronesty Jan 06 '21 at 21:41
12

The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.

The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.

Winston Ewert
  • 44,070
  • 10
  • 68
  • 83
8

Stack is the best tool for the job: -

import re
def matches(line, opendelim='(', closedelim=')'):
    stack = []

    for m in re.finditer(r'[{}{}]'.format(opendelim, closedelim), line):
        pos = m.start()

        if line[pos-1] == '\\':
            # skip escape sequence
            continue

        c = line[pos]

        if c == opendelim:
            stack.append(pos+1)

        elif c == closedelim:
            if len(stack) > 0:
                prevpos = stack.pop()
                # print("matched", prevpos, pos, line[prevpos:pos])
                yield (prevpos, pos, len(stack))
            else:
                # error
                print("encountered extraneous closing quote at pos {}: '{}'".format(pos, line[pos:] ))
                pass

    if len(stack) > 0:
        for pos in stack:
            print("expecting closing quote to match open quote starting at: '{}'"
                  .format(line[pos-1:]))

In the client code, since the function is written as a generator function simply use the for loop pattern to unroll the matches: -

line = '(((1+0)+1)+1)'
for openpos, closepos, level in matches(line):
    print(line[openpos:closepos], level)

This test code produces following on my screen, noticed the second param in the printout indicates the depth of the parenthesis.

1+0 2
(1+0)+1 1
((1+0)+1)+1 0
Benny Khoo
  • 725
  • 6
  • 15
3

From a linked answer:

From the LilyPond convert-ly utility (and written/copyrighted by myself, so I can show it off here):

def paren_matcher (n):
    # poor man's matched paren scanning, gives up
    # after n+1 levels.  Matches any string with balanced
    # parens inside; add the outer parens yourself if needed.
    # Nongreedy.
    return r"[^()]*?(?:\("*n+r"[^()]*?"+r"\)[^()]*?)*?"*n

convert-ly tends to use this as paren_matcher (25) in its regular expressions which is likely overkill for most applications. But then it uses it for matching Scheme expressions.

Yes, it breaks down after the given limit, but the ability to just plug it into regular expressions still beats the "correct" alternatives supporting unlimited depth hands-down in usability.

  • How does this work? It returns just the first/outer matched paren, is that correct? So you would keep calling it on the results to go deeper? And maybe you need a check to see if there are any more ( to parse as it will return something when there are no parens. – nycynik Jan 07 '21 at 07:32
1

I believe this function may suit your need, I threw this together fast so feel free to clean it up a bit. When doing nests its easy to think of it backwards and work from there =]

def fn(string,endparens=False):
    exp = []
    idx = -1
    for char in string:
        if char == "(":
            idx += 1
            exp.append("")
        elif char == ")":
            idx -= 1
            if idx != -1:
                exp[idx] = "(" + exp[idx+1] + ")"
        else:
            exp[idx] += char
    if endparens:
        exp = ["("+val+")" for val in exp]
    return exp
  • 1
    i appreciate the sentiment, yes its lame to use a verbose function provide the list requested. forgive me if my answer doesn't match the utility of @phooji but i doubt even an improper parsing module is required here. though i've learned a whole ton of parsing from above examples with written descriptions. – Michael Gillette Mar 28 '11 at 04:07
  • 1
    @pynator: I'm confused by the semantic structure of your comment, but I think I agree :) – phooji Mar 28 '11 at 04:09
1

Balanced pairs (of parentheses, for example) is an example of a language that cannot be recognized by regular expressions.

What follows is a brief explanation of the math for why that is.

Regular expressions are a way of defining finite state automata (abbreviated FSM). Such a device has a finite amount of possible state to store information. How that state can be used is not particularly restricted, but it does mean that there are an absolute maximum number of distinct positions it can recognize.

For example, the state can be used for counting, say, unmatched left parentheses. But because the amount of state for that kind of counting must be completely bounded, then a given FSM can count to a maximum of n-1, where n is the number of states the FSM can be in. If n is, say, 10, then the maximum number of unmatched left parenthesis the FSM can match is 10, until it breaks. Since it's perfectly possible to have one more left parenthesis, there is no possible FSM that can correctly recognize the complete language of matched parentheses.

So what? Suppose you just pick a really large n? The problem is that as a way of describing FSM, regular expressions basically describe all of the transitions from one state to another. Since for any N, an FSM would need 2 state transitions (one for matching a left parenthesis, and one for matching right), the regular expression itself must grow by at least a constant factor multiple of n

By comparison, the next better class of languages, (context free grammars) can solve this problem in a totally compact way. Here's an example in BNF

expression ::= `(` expression `)` expression
           |    nothing
 
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
  • 2
    You are correct, but also wrong ;) Modern programming languages include regex facilities that can match stuff like `L = { ww | w \in {0,1}* }`, which is neither regular nor context-free. It is true that they cannot match arbitrarily nested matching parentheses, but not for the reason you mentioned. – phooji Mar 28 '11 at 04:15
  • 3
    Historically, regular expressions started out as a means to express FSM, but over the time, regular expressions gained much power so that most modern regular expression's expressive power goes beyond regular grammar. Therefore, discussions about FSM are not necessarily relevant to regular expressions. – sawa Mar 28 '11 at 04:17
  • We're talking at cross purposes. See http://montreal.pm.org/tech/neil_kandalgaonkar.shtml for an example of what I'm talking about. – phooji Mar 28 '11 at 04:58
  • You are wrong. There are languages that are proper subset of context free language while being unrecognizable by regular grammars: for example, Deterministic context-free languages, or Visibly pushdown languages. (More details: http://en.wikipedia.org/wiki/Deterministic_context-free_grammar, http://en.wikipedia.org/wiki/Nested_word) – sawa Mar 28 '11 at 05:53
0

Here is a demo for your question, though it is clumsy, while it works

import re s = '(((1+0)+1)+1)'

def getContectWithinBraces( x , *args , **kwargs):
    ptn = r'[%(left)s]([^%(left)s%(right)s]*)[%(right)s]' %kwargs
    Res = []
    res = re.findall(ptn , x)
    while res != []:
        Res = Res + res
        xx = x.replace('(%s)' %Res[-1] , '%s')
        res = re.findall(ptn, xx)
        print(res)
        if res != []:
            res[0] = res[0] %('(%s)' %Res[-1])
    return Res

getContectWithinBraces(s , left='\(\[\{' , right = '\)\]\}')
Chen
  • 117
  • 1
  • 1
  • 9
0

my solution is that: define a function to extract content within the outermost parentheses, and then you call that function repeatedly until you get the content within the innermost parentheses.

def get_string_inside_outermost_parentheses(text):
    content_p = re.compile(r"(?<=\().*(?=\))")
    r = content_p.search(text)
    return r.group() 

def get_string_inside_innermost_parentheses(text):
    while '(' in text:
        text = get_string_inside_outermost_parentheses(text)
    return text
Marti
  • 21
  • 3
  • 1
    This does what it says but can't handle non-nested parentheses. E.g. `get_string_inside_outermost_parentheses('a(b)c(d)e')` returns `'b)c(d'`. – Bill May 09 '23 at 00:17
0

You can use regexps, but you need to do the recursion yourself. Something like the following does the trick (if you only need to find, as your question says, all the expressions enclosed into parentheses):

import re

def scan(p, string):
    found = p.findall(string)
    for substring in found:
        stripped = substring[1:-1]
        found.extend(scan(p, stripped))
    return found

p = re.compile('\(.+\)')
string = '(((1+0)+1)+1)'
all_found = scan(p, string)
print all_found

This code, however, does not match the 'correct' parentheses. If you need to do that you will be better off with a specialized parser.

-2

Many posts suggest that for nested braces, REGEX IS NOT THE WAY TO DO IT. SIMPLY COUNT THE BRACES: For example, see: Regular expression to detect semi-colon terminated C++ for & while loops

Here is a complete python sample to iterate through a string and count braces:

# decided for nested braces to not use regex but brace-counting
import re, string

texta = r'''
nonexistent.\note{Richard Dawkins, \textit{Unweaving the Rainbow: Science, Delusion
and the Appetite for Wonder} (Boston: Houghton Mifflin Co., 1998), pp. 302, 304,
306-309.} more text and more.

 This is a statistical fact, not a
guess.\note{Zheng Wu, \textit{Cohabitation: An Alternative Form
of Family Living} (Ontario, Canada: Oxford University Press,
2000), p. 149; \hbox{Judith} Treas and Deirdre Giesen, ``Title
and another title,''
\textit{Journal of Marriage and the Family}, February 2000,
p.\,51}

more and more text.capitalize
'''
pos = 0
foundpos = 0
openBr = 0 # count open braces
while foundpos <> -1:
    openBr = 0
    foundpos = string.find(texta, r'\note',pos)
    # print 'foundpos',foundpos
    pos = foundpos + 5
    # print texta[pos]
    result = ""
    while foundpos > -1 and openBr >= 0:
        pos = pos + 1
        if texta[pos] == "{":
            openBr = openBr + 1
        if texta[pos] == "}":
            openBr = openBr - 1
        result = result + texta[pos]
    result = result[:-1] # drop the last } found.
    result = string.replace(result,'\n', ' ') # replace new line with space
    print result
Community
  • 1
  • 1