-2

I'd like to remove all paragraphs starting strings in a list (non-case sensitive): ["keyword", "disclosure"]

My code:

re.sub("(?i)\n(keyword|disclosure).*(\n|$)", "\n", txt)

This works fine if there is at least one paragraph between the bad paragraphs, but it does not work if there is more than one bad paragraph in a row.
For example:

Text text text
Keywords: text text, text. Texts
Disclosures of stuff text more texts
Stuff text text

Results in the subsequent bad paragraphs getting missed:

Text text text
Disclosures of stuff text more texts
Stuff text text

Instead of what I would like to see:

Text text text
Stuff text text

How can I ensure all repeated matches are also replaced? Preferably I'd also like repeated matches treated as the same match so I don't get extra newlines, but if it's much cleaner and easier to just replace repeated newlines with a newline after, that's ok.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Pickle
  • 74
  • 6
  • 1
    It won't find overlapping matches, and the `\n` at the end of the first match is also the `\n` at the beginning of the next match. Use lookarounds so the newlines aren't included in the matches. Or use `^` and `$` along with the `re.MULTILINE` flag. – Barmar Sep 01 '23 at 16:49

3 Answers3

3
import re

txt = """
Text text text
Keywords: text text, text. Texts
Disclosures of stuff text more texts
Stuff text text
"""

keywords = ["keyword", "disclosure"]

pattern = "(?i)^(" + "|".join(keywords) + ").*?(\n|$)"
result = re.sub(pattern, "", txt, flags=re.MULTILINE)

print(result)

Output

Text text text
Stuff text text
Sauron
  • 551
  • 2
  • 11
3

The '\n' at the start and end of the match overlap with each other for consecutive matches. Instead, use ^ to match the start, which will also cover the case where the string starts with a paragraph you want to remove. It requires flag MULTILINE to work. The replacement then becomes the null string.

import re

txt = '''\
Text text text
Keywords: text text, text. Texts
Disclosures of stuff text more texts
Stuff text text'''

result = re.sub("(?im)^(keyword|disclosure).*(\n|$)", "", txt)
print(result)
Text text text
Stuff text text
wjandrea
  • 28,235
  • 9
  • 60
  • 81
2

Use re.MULTILINE (i.e. (?m)) and ^/$ anchors:

import re


txt = """
Text text text
Keywords: text text, text. Texts
Disclosures of stuff text more texts
Stuff text text
"""

print(re.sub("(?mi)^(keyword|disclosure).*$", "", txt))

prints out

Text text text


Stuff text text

and you can then clean out multiple newlines (since this will have replaced the entire line with an empty string) if you need to.

AKX
  • 152,115
  • 15
  • 115
  • 172
  • Thanks! This got me close but the solution @wjandrea posted using `(\n|$)` instead of `$` also conveniently solved the issue of extra newlines :3 – Pickle Sep 01 '23 at 17:58