0

stopwords is a list of strings, tokentext is a list of lists of strings. (Each list is a sentence, the list of lists is an text document).
I am simply trying to take out all the strings in tokentext that also occur in stopwords.

for element in tokentext:
    for word in element:
        if(word.lower() in stopwords):
             element.remove(word)

print(tokentext)

I was hoping for someone to point out some fundamental flaw in the way I am iterating over the list..

Here is a data set where it fails: http://pastebin.com/p9ezh2nA

OctaveParango
  • 113
  • 1
  • 14

2 Answers2

3

Altering a list while iterating on it will always create issues. Try instead something like:

stopwords = ["some", "strings"]
tokentext = [ ["some", "lists"], ["of", "strings"] ]

new_tokentext = [[word for word in lst if word not in stopwords] for lst in tokentext]
# creates a new list of words, filtering out from stopwords

Or using filter:

new_tokentext = [list(filter(lambda x: x not in stopwords, lst)) for lst in tokentext]
# the call to `list` here is unnecessary in Python2
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • @user3878398 I need to know how this "doesn't seem to do it." This works in my example, so if my example differs from your setup, then I'll need to know what's going wrong to fix it – Adam Smith Jan 19 '15 at 05:07
  • if i knew i would tell you :) your dummie example is correct, but somehow when i run it myself on my stopwords, tokentext, it doesn't work.. I am puzzled. – OctaveParango Jan 19 '15 at 05:21
  • @user3878398 YOUR solution works, as well. [This is my output](http://codepad.org/EMY37hfI) – Adam Smith Jan 19 '15 at 05:24
  • No my solution does not work.. I am not that stupid :p. try with this http://pastebin.com/qSudHq8k – OctaveParango Jan 19 '15 at 05:36
  • @user3878398 `tokentext` is not a list of lists in that paste. It's a single list of strings, followed by a bunch of lists that you aren't assigning to a variable :) – Adam Smith Jan 19 '15 at 05:38
  • Sorry it's late i'm tired haha.. ignore line 36 and add a set of brackets from start until end of line 35.. That should be your tokentext – OctaveParango Jan 19 '15 at 05:44
-2

You could just do something simple like:

for element in tokentext:
    if element in stop words:
        stopwords.remove(element)

It's kinda like yours, but without the extra for loop. But I am not sure if this works, or if that's what you are trying to achieve, but it's an idea, and I hope it helps!

Chris Nguyen
  • 160
  • 1
  • 4
  • 14
  • This logic is backwards (you're removing words from `stopwords` rather than `element`) and if `tokentext` is a list of lists of strings, then `element` is a list of strings, so `element` will never be in `stop_words` (which is also a list of strings) – Adam Smith Jan 19 '15 at 05:13