How to remove list of words from a list of strings

Question

Sorry if the question is bit confusing. This is similar to this question

I think this the above question is close to what I want, but in Clojure.

There is another question

I need something like this but instead of '[br]' in that question, there is a list of strings that need to be searched and removed.

Hope I made myself clear.

I think that this is due to the fact that strings in python are immutable.

I have a list of noise words that need to be removed from a list of strings.

If I use the list comprehension, I end up searching the same string again and again. So, only "of" gets removed and not "the". So my modified list looks like this

places = ['New York', 'the New York City', 'at Moscow' and many more]

noise_words_list = ['of', 'the', 'in', 'for', 'at']

for place in places:
    stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

I would like to know as to what mistake I'm doing.

You're not making yourself clear; state your question *here*, and then put links to similar questions with similar answers if you think that's necessary below. — Humphrey Bogart, Aug 18 '10 at 10:36

Tony Veijalainen · Answer 1 · 2010-08-19T08:34:28.607

15

Without regexp you could do like this:

places = ['of New York', 'of the New York']

noise_words_set = {'of', 'the', 'at', 'for', 'in'}
stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
         for place in places
         ]
print stuff

edited Aug 19 '10 at 08:34

answered Aug 18 '10 at 11:25

Tony Veijalainen

5,447
23
31

I came across this and had no idea whats going on here. If anyone stumbles across this and wonder what magic is happening, its called list comprehension and this is a good article explaining it http://carlgroner.me/Python/2011/11/09/An-Introduction-to-List-Comprehensions-in-Python.html – Eugene Niemand Jul 26 '17 at 10:53

score 11 · Accepted Answer · edited May 23 '17 at 12:18

11

Here is my stab at it. This uses regular expressions.

import re
pattern = re.compile("(of|the|in|for|at)\W", re.I)
phrases = ['of New York', 'of the New York']
map(lambda phrase: pattern.sub("", phrase),  phrases) # ['New York', 'New York']

Sans lambda:

[pattern.sub("", phrase) for phrase in phrases]

Update

Fix for the bug pointed out by gnibbler (thanks!):

pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
phrases = ['of New York', 'of the New York', 'Spain has rain']
[pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']

@prabhu: the above change avoids snipping off the trailing "in" from "Spain". To verify run both versions of the regular expressions against the phrase "Spain has rain".

edited May 23 '17 at 12:18

Community

1
1

answered Aug 18 '10 at 09:58

Manoj Govindan

72,339
21
134
141

Thanks. It works this way. I was able to understand the concept of lambda more clearly now as I got a chance to implement this. – prabhu Aug 18 '10 at 10:17
1

This doesn't work properly for the phrase "Spain has rain". It's easy to fix though – John La Rooy Aug 18 '10 at 10:29
@Gnibbler: thanks for pointing it out. Am changing my answer accordingly. – Manoj Govindan Aug 18 '10 at 10:47
I added the word "max" in to the pattern, and in some cases it removed the word, in other cases it didn't. It is weird, someone should test it to see if they're getting the same results. – almost a beginner Jan 08 '17 at 07:14

John La Rooy · Answer 3 · 2010-08-18T10:24:34.523

4

>>> import re
>>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
>>> phrases = ['of New York', 'of the New York']
>>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
>>> [noise_re.sub('',p) for p in phrases]
['New York', 'New York']

edited Aug 18 '10 at 10:24

answered Aug 18 '10 at 10:04

John La Rooy

295,403
53
369
502

Wow! That is a real cool way of doing, though I strained my brain. :-) – prabhu Aug 18 '10 at 10:21
This does not seem to get every instance of words. For example, "of New York of" becomes "New York of". – Namey May 05 '14 at 00:38
1

@Namey, you could use something like`'\\W?\\b(%s)\\W?'`. Without the OP providing a comprehensive set of testcases, it's a bit of a whack-a-mole – John La Rooy May 05 '14 at 01:12

score 1 · Answer 4 · answered Aug 18 '10 at 10:13

Since you would like to know what you are doing wrong, this line:

stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

takes place, and then begins to loop over words. First it checks for "of". Your place (e.g. "of the New York") is checked to see if it starts with "of". It is transformed (call to replace and strip) and added to the result list. The crucial thing here is that result is never examined again. For every word you iterate over in the comprehension, a new result is added to the result list. So the next word is "the" and your place ("of the New York") doesn't start with "the", so no new result is added.

I assume the result you got eventually is the concatenation of your place variables. A simpler to read and understand procedural version would be (untested):

results = []
for place in places:
    for word in words:
        if place.startswith(word):
            place = place.replace(word, "").strip()
    results.append(place)

Keep in mind that replace() will remove the word anywhere in the string, even if it occurs as a simple substring. You can avoid this by using regexes with a pattern something like ^the\b.

How to remove list of words from a list of strings

4 Answers4

Linked

Related