1

I have a list of processed text files, that looks somewhat like this:

text = "this is the first text document " this is the second text document " this is the third document "

I've been able to successfully tokenize the sentences:

sentences = sent_tokenize(text)
    for ii, sentence in enumerate(sentences):
        sentences[ii] = remove_punctuation(sentence)
sentence_tokens = [word_tokenize(sentence) for sentence in sentences]

And now I would like to remove stopwords from this list of tokens.
However, because it's a list of sentences within a list of text documents, I can't seem to figure out how to do this.

This is what I've tried so far, but it returns no results:

sentence_tokens_no_stopwords = [w for w in sentence_tokens if w not in stopwords]

I'm assuming achieving this will require some sort of for loop, but what I have now isn't working. Any help would be appreciated!

Jadeye
  • 3,551
  • 4
  • 47
  • 63
parker117
  • 13
  • 4

1 Answers1

2

You can create two lists generators like that:

sentence_tokens_no_stopwords = [[w for w in s if w not in stopwords] for s in sentence_tokens ]
Ohad Zadok
  • 3,452
  • 1
  • 22
  • 26