regex: balancing "{}" in a complex regex (python)

Question

I try to extract information from a complex string with regex. I try to extract what in the first { an last } as the content. Unfortunately, I struggle with nested {}. How is it possible to deal with this ?

I think the key is to balance the {} over the all regex by I haven't been successful so far... See example below for parenthesis: Regular expression to match balanced parentheses

import re

my_string = """
extend mineral Uraninite {
    kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
        }
    }
    kinetics {
        rate = 3.2e-09 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
        }
    }
}
"""

regex = re.compile(
        r"extend\s+"
        r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s+)?"
        r"(?P<species>[^\n ]+)\s+"
        r"{(?P<content>[^}]*)}\n\s+}")
extend_list = [m.groupdict() for m in regex.finditer(my_string)]

So far, I got:

print(extended_list["content"])

"""
    kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
"""

Appearly, I need to use the regex package regex because re does not support recursion. Indeed, this seems to work:

import regex as re
pattern = re.compile(r"{(?P<content>((?:[^{}]|(?R))*))}")
extend_list2 = [m.groupdict() for m in pattern.finditer(read_data)]

print(extended_list2["content"])

"""
kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
        }
    }
    kinetics {
        rate = 3.2e-09 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
        }
    }
"""

But inserting it in the main pattern does not work.

pattern = re.compile(
        r"extend\s+([^n]*)"
        r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s+)?"
        r"(?P<species>[^\n ]+)\s+"
        r"{(?P<content>((?:[^{}]|(?R))*))\}")
extend_list = [m.groupdict() for m in pattern.finditer(read_data)]

The `{` and `}` need to be escaped with backslash, i.e. use `\{` — Tim Biegeleisen, Nov 20 '21 at 09:11
Also look at [the `X` (aka `VERBOSE`) flag](https://docs.python.org/3/library/re.html#re.X), which allows you to format the regex in a structured form, even to include comments. For complex expressions that's definitely a plus. — Tomalak, Nov 20 '21 at 09:29
Regexes can’t easily count nested bracketing except in limited cases - you’re better using a proper parser — DisappointedByUnaccountableMod, Nov 20 '21 at 09:55

score 2 · Accepted Answer · answered Nov 20 '21 at 10:33

I believe the current regex can be written as

rx = r"extend\s+(.*)(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s+)?(?P<species>\S+)\s+({(?P<content>((?:[^{}]++|(?4))*))})"

The (?R) is changed into a regex subroutine, ({(?P<content>((?:[^{}]++|(?4))*))}). The group ID is Group 4 and the soubroutine declaration is thus (?4). You can quickly test it here.

The [^n]* looks like a typo, it matches zero or more non-n chars. I used .*, that matches zero or more chars other than line break chars as many as possible.

The [^\n ] looks like an attempt to match non-whitespace chunks, thus I suggest \S here.

Actually, it can be simplified to `rx = r"extend\s+(?:(?Pcolloid|mineral|basis|isotope|solid-solution)\s+)?(?P[^\n ]+)\s+({(?P((?:[^{}]++|(?3))*))})"`. — Antoine Collet, Nov 20 '21 at 14:19

quasi-human · Answer 2 · 2022-02-05T09:38:30.597

You can write simpler code with pyparsing to extract what in the first { and last } in my_string:

import pyparsing as pp

pattern = pp.Regex(r'.*?(?={)') + pp.original_text_for(pp.nested_expr('{', '}'))
result = pattern.parse_string(my_string)[1][1:-1]
print(result)

* pyparsing can be installed by pip install pyparsing

Note:

If a pair of parentheses gets broken inside {} (for example a{b{c}, a{b}c}, etc), an unexpected result is obtained or IndexError is raised. So be careful. (See: Python extract string in a phrase)

regex: balancing "{}" in a complex regex (python)

2 Answers2

Note: