Loop through a set of words and then use regex to remove the words from text

Question

I have a set of words (this set of words is dynamic so I have to use for loop)

a = {'i', 'the', 'at', 'it'}

And I have a text

text = 'i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope.'

Now I am trying to remove the word from the text but somehow it's not working. Here is what I am using:

for word in a:
    text = re.sub(r'\bword\b', '', text).strip()

score 1 · Answer 1 · answered Feb 21 '23 at 03:14

1

Your regex is looking for the literal string "word". You should use f-strings to use the value stored in the variable named word:

text = re.sub(rf'\b{word}\b', '', text).strip()

answered Feb 21 '23 at 03:14

Selcuk

57,004
12
102
110

1

Ah neat. For some reason I didn't think you could combine `f''` and `r''` strings! – flakes Feb 21 '23 at 03:31

Digital Deception · Answer 2 · 2023-02-21T03:36:33.853

The reason this isn't working is that you are looking for the literal string "word". You're after:

text=re.sub(rf'\b{word}\b', '', text).strip()

This adds the actual value of word into the string.

When working debugging regex, it helps to log the match so you can check it is doing what you expect.

import re;

a={'i', 'the', 'at', 'it'}
text='i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope.'

for word in a:
    print(f'Updating text, removing "{word}" from: "{text}"')
    # text=re.sub(r'\bword\b', '', text).strip()
    print(re.search(r'\bword\b', text))

Updating text, removing "at" from: "i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope."

None

You can see that this is not finding a match, but if we simplify your expression:

print(re.search(word, text))

Updating text, removing "it" from: "i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope."

<re.Match object; span=(35, 37), match='it'>

This does find a match, this suggests something is going wrong in your conversion to regex.

regex101 is really useful for diagnosing such things. Simply print the actual regex out, and test it against the input:

print(r'\bword\b')
print(rf'\b{word}\b')

\bword\b

\bthe\b

You probably also want to tidy up the white-space, you can do like this:

text=re.sub(rf'\b{word}\s?\b', '', text).strip()

want to jump rope. will do tomorrow 5pm. love to jump rope.

Miqueias Brikalski · Answer 3 · 2023-02-21T05:22:02.107

-1

Why import a library and not just use replace() instead?

list_words = {'i', 'the', 'at', 'it'}
text = 'i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope.'

for word in list_words:
    text = text.replace(word, "")

EDIT

This has a flaw, as pointed out by Seluck in the comment below.

edited Feb 21 '23 at 05:22

answered Feb 21 '23 at 03:22

Miqueias Brikalski

36
1
9

Because it doesn't work. `\b` means word boundary, which your solution doesn't take into account. Have you tested this? – Selcuk Feb 21 '23 at 04:03
It removes all instances of the strings in `list_words` that occur in `text`, even in cases when it is not a word such as 'i' in will. I missed that, my bad. By the way, it was an honest question. I didn't think it would make sense to create a new question, but I am not yet allowed to publish comments either on other people's answers or on the original question. So I posted the question as an answer, also because it is a question other people may think of. What is the harm? – Miqueias Brikalski Feb 21 '23 at 05:16
Also, there is still another potential solution that uses only built-in functions: `text = ' '.join([word for word in text.split() if word not in list_words])` Tho it might be too inefficient. What do you think? – Miqueias Brikalski Feb 21 '23 at 05:19
Note that word boundaries doesn't have to be whitespace only; they can also be punctuations. Your second attempt still doesn't work for cases like `"hello,world"`. Regex is the right tool here. – Selcuk Feb 21 '23 at 05:26
It works if the same separator is used between all words, which is the case for this question. – Miqueias Brikalski Feb 21 '23 at 05:44
Not really. It won't work if `rope` is in the stop word list. – Selcuk Feb 21 '23 at 11:10

Loop through a set of words and then use regex to remove the words from text

3 Answers3