Python regex replace substrings inside strings

Question

I have a string like this:

import re

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

And the expected output is:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

Here's a small subset of what I've tried:

# None of these work, and typically only replace the first or last occurence of <
re.sub(r'(?<=CODE)<(?=CODE)', r'_LT_', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)<(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)[<]*(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL|re.MULTILINE)
re.sub(r'(CODE.*?)<(.*?CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(CODE.*)<(.*CODE)', r'\1_LT_\2', text, flags=re.DOTALL)

What I'd like to happen: All occurrences of < between CODE and CODE to be replaced with _LT_.

After spending the day on stackoverflow and regex101.com, I'm starting to think either it's not possible or I'm not smart enough to handle this.

Any help is tremendously appreciated!

Thanks in advance.

preferably the answer would provide a regex that can produce the expected output? — Aaron Meier, Jun 06 '21 at 16:34
A function or regex is not described by one input value (your string) and one output value (the expected output). Please provide a more general description of what you want. — René Pijl, Jun 06 '21 at 16:36
@RenéPijl are you saying that the question isn't clear? I'll update it. — Aaron Meier, Jun 06 '21 at 16:37
@RenéPijl updated the to include what I'm after: All occurrences of `<` between `CODE` and `CODE` to be replaced with `_LT_`. — Aaron Meier, Jun 06 '21 at 16:40
I posted an answer that should satisfy your needs, can you check it out and tell me if something is wrong please ? — S-c-r-a-t-c-h-y, Jun 06 '21 at 17:05

score 3 · Accepted Answer · answered Jun 06 '21 at 17:05

Here is my answer:

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

output = ''
for i in range(len(text.split('CODE'))):
    if i % 2:
        output += text.split('CODE')[i].replace('>', '_GT_').replace('<', '_LT_')
    else:
        output += text.split('CODE')[i]


print(output)

With this solution, every code block is being formated and added to the output. This does not include regex but this works.

That's a pretty good solution, thanks very much! I was hoping for a regex version, but it's looking like it's not possible. — Aaron Meier, Jun 06 '21 at 17:17

score 2 · Answer 2 · answered Jun 06 '21 at 19:59

With regex:

import re
text = "\nSome stuff to keep <b>here</b>\n\nCODE\n<b>Replace gt and lt</b>\n<i>inside <script>this</script> code</i>\nCODE\n\nSome more stuff to keep <b>here</b>\n"
pattern = r"(?s)CODE.*?CODE"
print(re.sub(pattern, lambda x: x.group().replace('<','_LT_').replace('>','_GT_'), text))

See Python proof.

Results:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  CODE                     'CODE'
--------------------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
--------------------------------------------------------------------------------
  CODE                     'CODE'

Ivee Nara · Answer 3 · 2021-06-06T17:50:16.187

0

I'll update this answer in a few minutes with an only-regex solution but, meanwhile... Is not doing a split and then join strings a solution?

re.sub(regex, value, text.split("CODE\n")[1], flags)

EDIT! I found the answer, but it's a little bit hacky You can read the full description in this post: https://stackoverflow.com/a/11096811/8665327

Basically, the line you are looking for is this:

text = re.sub('\nCODE\n[^(CODE)]*\nCODE\n', lambda x: x.group(0).replace('<', '_LT_').replace('>', '_GT_'), text)

This will work with the first set of text placed between "CODE" text in its own line as long as there is no "CODE" string between them

will_work = """
<title>This will work</title>
CODE
<b>Replace this</b>
CODE
"""

wont_work = """
CODE
<b>This won't work</b>CODE
CODE
"""

edited Jun 06 '21 at 17:50

answered Jun 06 '21 at 16:44

Ivee Nara

34
3

Yeah looks like this will be the way to handle it I think. – Aaron Meier Jun 06 '21 at 17:17
Thanks by the way! If you can come up with a regex I'd love to see that. Seems like a hard problem though. – Aaron Meier Jun 06 '21 at 17:26
1

@AaronMeier I edited the comment :) The search 'python replace substring between two characters' on Google gave me that link https://stackoverflow.com/a/11096811/8665327 – Ivee Nara Jun 06 '21 at 17:52

Python regex replace substrings inside strings

3 Answers3