Remove all occurrences of words in a string from a python list

Question

I'm trying to match and remove all words in a list from a string using a compiled regex but I'm struggling to avoid occurrences within words.

Current:

 REMOVE_LIST = ["a", "an", "as", "at", ...]

 remove = '|'.join(REMOVE_LIST)
 regex = re.compile(r'('+remove+')', flags=re.IGNORECASE)
 out = regex.sub("", text)

In: "The quick brown fox jumped over an ant"

Out: "quick brown fox jumped over t"

Expected: "quick brown fox jumped over"

I've tried changing the string to compile to the following but to no avail:

 regex = re.compile(r'\b('+remove+')\b', flags=re.IGNORECASE)

Any suggestions or am I missing something garishly obvious?

Presumably `ant` is part of your remove list? – Martijn Pieters Mar 15 '13 at 15:07 — Martijn Pieters, Mar 15 '13 at 15:07

score 19 · Answer 1 · answered Mar 15 '13 at 15:19

19

here is a suggestion without using regex you may want to consider:

>>> sentence = 'word1 word2 word3 word1 word2 word4'
>>> remove_list = ['word1', 'word2']
>>> word_list = sentence.split()
>>> ' '.join([i for i in word_list if i not in remove_list])
'word3 word4'

answered Mar 15 '13 at 15:19

jurgenreza

5,856
2
25
37

Groovy. Hadn't thought of that. Thanks :) – Ogre Mar 15 '13 at 15:21
It's worth pointing out that this will have difficulty with punctuation, and will not preserve tabs/consecutive whitespaces (not sure if the latter is important). – NPE Mar 15 '13 at 15:23
3

It's worth noting that if `remove_list` is large, you would be better off with `remove_set = {'word1', 'word2', ...}` as sets have much faster membership tests. – Gareth Latty Mar 15 '13 at 15:24
@NPE You are right. We don't know the exact usage of the OP so I thought they might want to consider it. – jurgenreza Mar 15 '13 at 15:31

score 14 · Accepted Answer · answered Mar 15 '13 at 15:11

14

One problem is that only the first \b is inside a raw string. The second gets interpreted as the backspace character (ASCII 8) rather than as a word boundary.

To fix, change

regex = re.compile(r'\b('+remove+')\b', flags=re.IGNORECASE)

to

regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
                                 ^ THIS

answered Mar 15 '13 at 15:11

NPE

486,780
108
951
1,012

2

As a trick to discover this (aside from knowing this beforehand), output the pattern with `regex.pattern` – nhahtdh Mar 15 '13 at 15:14
A bit cleaner still using f-strings: re.compile(fr"\b({remove})\b") – Pablo Feb 17 '23 at 08:12

Remove all occurrences of words in a string from a python list

2 Answers2

Linked

Related