1

I have a string like this:

import re

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

And the expected output is:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

Here's a small subset of what I've tried:

# None of these work, and typically only replace the first or last occurence of <
re.sub(r'(?<=CODE)<(?=CODE)', r'_LT_', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)<(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)[<]*(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL|re.MULTILINE)
re.sub(r'(CODE.*?)<(.*?CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(CODE.*)<(.*CODE)', r'\1_LT_\2', text, flags=re.DOTALL)

What I'd like to happen: All occurrences of < between CODE and CODE to be replaced with _LT_.

After spending the day on stackoverflow and regex101.com, I'm starting to think either it's not possible or I'm not smart enough to handle this.

Any help is tremendously appreciated!

Thanks in advance.

Aaron Meier
  • 929
  • 9
  • 21

3 Answers3

3

Here is my answer:

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

output = ''
for i in range(len(text.split('CODE'))):
    if i % 2:
        output += text.split('CODE')[i].replace('>', '_GT_').replace('<', '_LT_')
    else:
        output += text.split('CODE')[i]


print(output)

With this solution, every code block is being formated and added to the output. This does not include regex but this works.

S-c-r-a-t-c-h-y
  • 321
  • 2
  • 5
  • That's a pretty good solution, thanks very much! I was hoping for a regex version, but it's looking like it's not possible. – Aaron Meier Jun 06 '21 at 17:17
2

With regex:

import re
text = "\nSome stuff to keep <b>here</b>\n\nCODE\n<b>Replace gt and lt</b>\n<i>inside <script>this</script> code</i>\nCODE\n\nSome more stuff to keep <b>here</b>\n"
pattern = r"(?s)CODE.*?CODE"
print(re.sub(pattern, lambda x: x.group().replace('<','_LT_').replace('>','_GT_'), text))

See Python proof.

Results:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  CODE                     'CODE'
--------------------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
--------------------------------------------------------------------------------
  CODE                     'CODE'
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
0

I'll update this answer in a few minutes with an only-regex solution but, meanwhile... Is not doing a split and then join strings a solution?

re.sub(regex, value, text.split("CODE\n")[1], flags)

EDIT! I found the answer, but it's a little bit hacky You can read the full description in this post: https://stackoverflow.com/a/11096811/8665327

Basically, the line you are looking for is this:

text = re.sub('\nCODE\n[^(CODE)]*\nCODE\n', lambda x: x.group(0).replace('<', '_LT_').replace('>', '_GT_'), text)

This will work with the first set of text placed between "CODE" text in its own line as long as there is no "CODE" string between them

will_work = """
<title>This will work</title>
CODE
<b>Replace this</b>
CODE
"""

wont_work = """
CODE
<b>This won't work</b>CODE
CODE
"""

Ivee Nara
  • 34
  • 3
  • Yeah looks like this will be the way to handle it I think. – Aaron Meier Jun 06 '21 at 17:17
  • Thanks by the way! If you can come up with a regex I'd love to see that. Seems like a hard problem though. – Aaron Meier Jun 06 '21 at 17:26
  • 1
    @AaronMeier I edited the comment :) The search 'python replace substring between two characters' on Google gave me that link https://stackoverflow.com/a/11096811/8665327 – Ivee Nara Jun 06 '21 at 17:52