0

I need to capture the text from the \textbf{} command, \textbf will have multiple nested braces like below

\textbf{adadasas}

\textbf{adadasas \textit{xxx} adasda {xxx}}

\textbf{adadasas {} {} {} dxxxx}

i want to capture the value inside the \textbf{...}

i tried with the regex in python {([^{}]*+(?:(?R)[^{}]*)*+)} (from: Recursive pattern in regex)

x = regex.findall(r'\\textbf{([^{}]*+(?:(?R)[^{}]*)*+)}',cnt)

i am not getting all the value. when removing the text \\textbf in the regex it is capture all the occurances.

Please suggest how to write a regex for the one

TeX_learner
  • 123
  • 6

1 Answers1

3

You can repeat the first capture group (?1) instead of repeating the whole pattern with (?R) and capture what is inside the {} with group 2

\\textbf({([^{}]*+(?:(?1)[^{}]*)*+)})
  • \\textbf Match \textbf
  • ( Capture group 1
    • { Match a { char
    • ( Capture group 2
      • [^{}]*+ Optionally match any char except { } with a possessive quantifier
        • (?: Non capture group to match as a whole
          • (?1)[^{}]* Recurse the first subroutine and optionally match any char except curly's
        • )*+ Close the non capture group and optionally repeat using a possessive quantifier
    • ) Close group 2
    • } Match a } char
  • ) Close group 1

Regex demo

Note that if you use re.findall, you will get all values of the capture groups returned, and this pattern has 2 capture groups.

You can use re.finditer instead and get the group 2 value:

import regex

pattern = r"\\textbf({([^{}]*+(?:(?1)[^{}]*)*+)})"

cnt = ("\\textbf{adadasas}\n"
            "\\textbf{adadasas \\textit{xxx} adasda {xxx}}\n"
            "\\textbf{adadasas {} {} {} dxxxx}\n"
            "{adadasas {} {} {} dxxxx}")

matches = regex.finditer(pattern, cnt)

for _, match in enumerate(matches, start=1):
    print(match.group(2))

Output

adadasas
adadasas \textit{xxx} adasda {xxx}
adadasas {} {} {} dxxxx
The fourth bird
  • 154,723
  • 16
  • 55
  • 70