0

I am trying to create a regex that would find one or more indices of the following pattern:

some text + {text within braces} + {text within braces}

The trick is that the text within braces may include braces as well:

some text + {text withi{n} braces} + {tex{t} within {b}races}

I am able to identify all the three patterns seperately but cannot combine the whole thing so that it would identify the nested inner braces.

import re
import regex

v1_value="A"
v2_value="B"
v_string=rf'\\to{v1_value}or{v2_value}' # dynamically defining the value of the version string
print(f'v_string: {v_string}') # \\toAorB:


match_outer_braces=r"\{(?:[^{}]*|(?R))*\}" # source: https://stackoverflow.com/a/63266732/7147695

whole_pattern=v_string+match_outer_braces*2 # combining the pattern (probably goes wrong here)

sentence1=r"Lorem \toAorB{versionA}{VersionB} ipsum" # sentence with no nested braces
sentence2=r"Lorem \toAorB{versionA}{Ver{s}ionB} ipsum" # sentence with braces within braces


extracted1=regex.findall(whole_pattern,sentence1)      # extracts the pattern as desired (no nested braces)
extracted2=regex.findall(match_outer_braces,sentence2) # extracts the outer braces
extracted3=regex.findall(whole_pattern,sentence2)      # does not manage to extract the whole pattern with nested braces

print(extracted1) # ['\\toAorB{versionA}{VersionB}']
print(extracted2) # ['{versionA}', '{Ver{s}ionB}']
print(extracted3) # []
martineau
  • 119,623
  • 25
  • 170
  • 301
Samuel Saari
  • 1,023
  • 8
  • 13
  • Regular expressions cannot do that. It requires context-free grammar, Type-2 in [Chomsky hierarchy](https://en.wikipedia.org/wiki/Chomsky_hierarchy), and regular expressions are mostly limited to Type-3 – Marat Aug 02 '22 at 14:15
  • @marat, I probably just don't understand but if the ```extracted2``` above using the new python regex-module can recursively identify nested braces, combining that with some text and another identical regex brace match with \1, {2} or the like should, in theory, be just a little tweak away. – Samuel Saari Aug 02 '22 at 16:45
  • I just learned about recursive expressions and find them fascinating. Please disregard my first comment – Marat Aug 02 '22 at 16:49

1 Answers1

0

Would like to see a regex solution, as this is very wordy. In any case,this elementary way did the job for me. It does not exactly match the \toAorB{}{} pattern, but rather extracts the contents of either A or B version and deletes the \toAorB{}{} construct, which I would have done after the matching anyways.

sentence=sentence1+sentence2


match_count=0
brace_count=0
new_sentence=""
match_helper=""
match_dictionary={}
match_active=False
v1_boolean=False

if v1_boolean:
    version_number=1
else:
    version_number=2

for letter in sentence:
    if match_active==False:
        new_sentence += letter # start writing sentence
        if new_sentence.endswith(v_string):
            new_sentence=new_sentence.split(v_string)[0] # extract only until the \toV1orV2 starts
            match_active=True

    elif match_active:
        match_helper+=letter # start writing match text
        if letter=="{":
            brace_count +=1
        elif letter=="}":
            brace_count -=1
        if brace_count==0: # if outer brace closes, store value in dictionary
            match_count+=1
            match_dictionary[match_count]=match_helper
            match_helper=""
        if match_count==2: # when two matches in dictionary, write the right one in sentence an continue
            new_sentence+=match_dictionary[version_number][1:-1] # removes the {} from beginning and end. NB! Not robust!
            match_dictionary={}
            match_count=0
            match_active=False


print('----')
print(sentence) # Lorem \\toAorB{versionA}{VersionB} ipsum. Lorem \\toAorB{versionA}{Ver{s}ionB} ipsum.
print(new_sentence) # Lorem VersionB ipsum. Lorem Ver{s}ionB ipsum.
Samuel Saari
  • 1,023
  • 8
  • 13