2

I try to extract information from a complex string with regex. I try to extract what in the first { an last } as the content. Unfortunately, I struggle with nested {}. How is it possible to deal with this ?

I think the key is to balance the {} over the all regex by I haven't been successful so far... See example below for parenthesis: Regular expression to match balanced parentheses

import re

my_string = """
extend mineral Uraninite {
    kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
        }
    }
    kinetics {
        rate = 3.2e-09 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
        }
    }
}
"""

regex = re.compile(
        r"extend\s+"
        r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s+)?"
        r"(?P<species>[^\n ]+)\s+"
        r"{(?P<content>[^}]*)}\n\s+}")
extend_list = [m.groupdict() for m in regex.finditer(my_string)]

So far, I got:

print(extended_list["content"])

"""
    kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
"""

Appearly, I need to use the regex package regex because re does not support recursion. Indeed, this seems to work:

import regex as re
pattern = re.compile(r"{(?P<content>((?:[^{}]|(?R))*))}")
extend_list2 = [m.groupdict() for m in pattern.finditer(read_data)]

print(extended_list2["content"])

"""
kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
        }
    }
    kinetics {
        rate = 3.2e-09 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[+]
            power = 0.37
        }
    }
"""

But inserting it in the main pattern does not work.

pattern = re.compile(
        r"extend\s+([^n]*)"
        r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s+)?"
        r"(?P<species>[^\n ]+)\s+"
        r"{(?P<content>((?:[^{}]|(?R))*))\}")
extend_list = [m.groupdict() for m in pattern.finditer(read_data)]
Antoine Collet
  • 348
  • 2
  • 14

2 Answers2

2

I believe the current regex can be written as

rx = r"extend\s+(.*)(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s+)?(?P<species>\S+)\s+({(?P<content>((?:[^{}]++|(?4))*))})"

The (?R) is changed into a regex subroutine, ({(?P<content>((?:[^{}]++|(?4))*))}). The group ID is Group 4 and the soubroutine declaration is thus (?4). You can quickly test it here.

The [^n]* looks like a typo, it matches zero or more non-n chars. I used .*, that matches zero or more chars other than line break chars as many as possible.

The [^\n ] looks like an attempt to match non-whitespace chunks, thus I suggest \S here.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Actually, it can be simplified to `rx = r"extend\s+(?:(?Pcolloid|mineral|basis|isotope|solid-solution)\s+)?(?P[^\n ]+)\s+({(?P((?:[^{}]++|(?3))*))})"`. – Antoine Collet Nov 20 '21 at 14:19
1

You can write simpler code with pyparsing to extract what in the first { and last } in my_string:

import pyparsing as pp

pattern = pp.Regex(r'.*?(?={)') + pp.original_text_for(pp.nested_expr('{', '}'))
result = pattern.parse_string(my_string)[1][1:-1]
print(result)

* pyparsing can be installed by pip install pyparsing

Note:

If a pair of parentheses gets broken inside {} (for example a{b{c}, a{b}c}, etc), an unexpected result is obtained or IndexError is raised. So be careful. (See: Python extract string in a phrase)

quasi-human
  • 1,898
  • 1
  • 2
  • 13