0

I have a set of words (this set of words is dynamic so I have to use for loop)

a = {'i', 'the', 'at', 'it'}

And I have a text

text = 'i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope.'

Now I am trying to remove the word from the text but somehow it's not working. Here is what I am using:

for word in a:
    text = re.sub(r'\bword\b', '', text).strip()
Selcuk
  • 57,004
  • 12
  • 102
  • 110
jangu
  • 15
  • 3

3 Answers3

1

Your regex is looking for the literal string "word". You should use f-strings to use the value stored in the variable named word:

text = re.sub(rf'\b{word}\b', '', text).strip()
Selcuk
  • 57,004
  • 12
  • 102
  • 110
1

The reason this isn't working is that you are looking for the literal string "word". You're after:

text=re.sub(rf'\b{word}\b', '', text).strip()

This adds the actual value of word into the string.

When working debugging regex, it helps to log the match so you can check it is doing what you expect.

import re;

a={'i', 'the', 'at', 'it'}
text='i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope.'

for word in a:
    print(f'Updating text, removing "{word}" from: "{text}"')
    # text=re.sub(r'\bword\b', '', text).strip()
    print(re.search(r'\bword\b', text))

Updating text, removing "at" from: "i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope."

None

You can see that this is not finding a match, but if we simplify your expression:

print(re.search(word, text))

Updating text, removing "it" from: "i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope."

<re.Match object; span=(35, 37), match='it'>

This does find a match, this suggests something is going wrong in your conversion to regex.

regex101 is really useful for diagnosing such things. Simply print the actual regex out, and test it against the input:

print(r'\bword\b')
print(rf'\b{word}\b')

\bword\b

\bthe\b


You probably also want to tidy up the white-space, you can do like this:

text=re.sub(rf'\b{word}\s?\b', '', text).strip()

want to jump rope. will do tomorrow 5pm. love to jump rope.

Digital Deception
  • 2,677
  • 2
  • 15
  • 24
-1

Why import a library and not just use replace() instead?

list_words = {'i', 'the', 'at', 'it'}
text = 'i want to jump the rope. i will do it tomorrow at 5pm. i love to jump the rope.'

for word in list_words:
    text = text.replace(word, "")

EDIT

This has a flaw, as pointed out by Seluck in the comment below.

  • Because it doesn't work. `\b` means word boundary, which your solution doesn't take into account. Have you tested this? – Selcuk Feb 21 '23 at 04:03
  • It removes all instances of the strings in `list_words` that occur in `text`, even in cases when it is not a word such as 'i' in will. I missed that, my bad. By the way, it was an honest question. I didn't think it would make sense to create a new question, but I am not yet allowed to publish comments either on other people's answers or on the original question. So I posted the question as an answer, also because it is a question other people may think of. What is the harm? – Miqueias Brikalski Feb 21 '23 at 05:16
  • Also, there is still another potential solution that uses only built-in functions: `text = ' '.join([word for word in text.split() if word not in list_words])` Tho it might be too inefficient. What do you think? – Miqueias Brikalski Feb 21 '23 at 05:19
  • Note that word boundaries doesn't have to be whitespace only; they can also be punctuations. Your second attempt still doesn't work for cases like `"hello,world"`. Regex is the right tool here. – Selcuk Feb 21 '23 at 05:26
  • It works if the same separator is used between all words, which is the case for this question. – Miqueias Brikalski Feb 21 '23 at 05:44
  • Not really. It won't work if `rope` is in the stop word list. – Selcuk Feb 21 '23 at 11:10