Removing phrases with re.sub with different letter cases in python

Question

I'm trying to remove some phrases out of a string not dependent on letter case, but there are some words/phrases ending with "&" that should remove everything including the word after the connected word/phrase.

import re

txt = "People have a different titles, like Dr. Mr. Miss and so on."
removed = "like", "Different titles,", "a", "Miss and&"
char = "&"


removed = list(sorted(removed))
for i in range(len(removed)):
    removed[i] = removed[i].upper()

txt = re.sub(fr"\s*(?<!\S)(?:{'|'.join(map(re.escape, removed))})(?!\S)", "", txt.upper()).strip()

This gives me "PEOPLE HAVE DR. MR. MISS AND SO ON.". Is there any way using regex to have the result as "PEOPLE HAVE DR. MR."? If not with regex then how to remove phrases from a string? At the moment I have tried it like this, unfortunately to no avail.

txt = "People have a different titles, like Dr. Mr. Miss and so on."
remove = "like", "a", "&Miss and", "&so"
ch = "&"
txt = txt.split()
new = []

for i in txt:
    if ch+i in remove:
        break
    else:
        if i not in remove:
            new.append(i)
result = " ".join(new)

I get the result "People have different titles, Dr. Mr. Miss and", whereas I would like to get "People have different titles, Dr. Mr."

You might try as a start `"People have a different titles, like Dr. Mr. Miss and so on.".rsplit(" Miss and")[0]` maybe — JonSG, Jan 20 '22 at 13:38
Try `txt = re.sub(fr"\s*\b(?:{'|'.join(map(re.escape, removed))})\b".replace('\\' + char, ".*"), "", txt.upper()).strip()`. also, check https://ideone.com/ZgdMbb — Wiktor Stribiżew, Jan 20 '22 at 13:56
Hey Wiktor. Unfortunately, this solution does not remove "Different titles," and also for some reason adds an extra dot to the end. Although the removal of the phrase with the character worked which is good to see. — thomasjohnson77, Jan 20 '22 at 16:14
`txt = re.sub(fr"\s*\b(?:{'|'.join(map(re.escape, removed))})".replace('\\' + char, ".*"), "", txt, flags=re.I).strip().upper()` worked. Again thanks a lot! @WiktorStribiżew — thomasjohnson77, Jan 20 '22 at 17:02
I have a better solution that I posted below. Please check and let me know if there is anything unclear. — Wiktor Stribiżew, Jan 21 '22 at 08:46

score 0 · Accepted Answer · answered Jan 20 '22 at 21:20

You can use

import re
txt = "People have a different titles, like Dr. Mr. Miss and so on."
removed = "like", "Different titles,", "a", "Miss and&"
char = "&"
removed = list(sorted(removed))
p = fr"\s*\b(?:{'|'.join(map(re.escape, removed))})(?:(?<=\w)\b|(?<!\w))"
txt = re.sub(p.replace(f'\{char}', ".*"), "", txt, flags=re.I).strip().upper()
print(txt)

See the Python demo. Output:

PEOPLE HAVE DR. MR.

The p pattern will look like \s*\b(?:Different\ titles,|Miss\ and\&|a|like)(?:(?<=\w)\b|(?<!\w)), where the \s* matches zero or more whitespaces, a word boundary \b, several alternatives you passed as the removed list, and then a (?:(?<=\w)\b|(?<!\w)) adaptive right-hand word boundary (it will require a word boundary on the right of the words only if they end with a word char. More details in my "Dynamic adaptive word boundaries" YT video).

The .replace(f'\{char}', ".*") "converts" & to .* to match any text (any zero or more chars other than line break chars as many as possible).

The flags=re.I option makes matching case insensitive, no need to lower() or upper() the removed words.

This worked great, thanks. But I'm facing a problem on how to match when connecting with special characters. `txt = "People ? have a different titles, like Dr. ! Mr. Miss and so on."` and `removed = ""like", "Different titles,","!&", "#&"` would result in `People # have a Dr.` But only "!", not all non-alphanumeric characters. Is it possible with 2 non-word characters? @WiktorStribiżew — thomasjohnson77, Jan 25 '22 at 16:36
@thomasjohnson77 Use adaptive word boundaries. `?!` and suchlike are not words, so no need to use `\b` at all here. See [this Python demo](https://ideone.com/cwYo1P). — Wiktor Stribiżew, Jan 25 '22 at 16:48

Removing phrases with re.sub with different letter cases in python

1 Answers1