I try to extract information from a complex string with regex. I try to extract what in the first {
an last }
as the content. Unfortunately, I struggle with nested {}
. How is it possible to deal with this ?
I think the key is to balance the {}
over the all regex by I haven't been successful so far... See example below for parenthesis:
Regular expression to match balanced parentheses
import re
my_string = """
extend mineral Uraninite {
kinetics {
rate = -3.2e-08 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[+]
power = 0.37
}
}
kinetics {
rate = 3.2e-09 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[+]
power = 0.37
}
}
}
"""
regex = re.compile(
r"extend\s+"
r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s+)?"
r"(?P<species>[^\n ]+)\s+"
r"{(?P<content>[^}]*)}\n\s+}")
extend_list = [m.groupdict() for m in regex.finditer(my_string)]
So far, I got:
print(extended_list["content"])
"""
kinetics {
rate = -3.2e-08 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[+]
power = 0.37
"""
Appearly, I need to use the regex package regex because re does not support recursion. Indeed, this seems to work:
import regex as re
pattern = re.compile(r"{(?P<content>((?:[^{}]|(?R))*))}")
extend_list2 = [m.groupdict() for m in pattern.finditer(read_data)]
print(extended_list2["content"])
"""
kinetics {
rate = -3.2e-08 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[+]
power = 0.37
}
}
kinetics {
rate = 3.2e-09 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[+]
power = 0.37
}
}
"""
But inserting it in the main pattern does not work.
pattern = re.compile(
r"extend\s+([^n]*)"
r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s+)?"
r"(?P<species>[^\n ]+)\s+"
r"{(?P<content>((?:[^{}]|(?R))*))\}")
extend_list = [m.groupdict() for m in pattern.finditer(read_data)]