nltk stopwords removal gives the wrong output

Question

I have an issue with removing stopwords. When I execute my script:`

import nltk
from nltk.corpus import stopwords
file1=open('english.txt', 'r')
english=file1.read()
file1.close()
english_corpus_lowercase =([w.lower() for w in english])
english_without_punc=''.join([c for c in english_corpus_lowercase if c not in (",", "``", "`", "?", ".", ";", ":", "!", "''", "'", '"', "-", "(", ")")])
print(english_without_punc)
print(type(english_without_punc))
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)
english_corpus_sans_stopwords = set()
for w in english_without_punc:
    if w not in stopwords:
        english_corpus_sans_stopwords.add(w)
        print(english_corpus_sans_stopwords)

It gives me the following. How could I fix it?

{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}
{'b', 'n', 'f', 'l', 'v', 'h', 'k', 'e', 'r', ' ', 'w', '“', 'g', 'u', 'p', 'c'}

Your `english_corpus_lowercase` is not a list of words, but a character string. You must tokenize it first. — DYZ, Aug 11 '17 at 22:30
As a side note, since "``" and the like are not single-character strings, they will never be eliminated from your text. — DYZ, Aug 11 '17 at 22:38

M3RS · Accepted Answer · 2017-08-11T23:07:34.607

0

Try the below:

import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

file1 = open('english.txt', 'r')
english = file1.read()
file1.close()

english_corpus_lowercase = [w.lower() for w in word_tokenize(english)] 
english_without_punc = [c for c in english_corpus_lowercase if c not in (",", "``", "`", "?", ".", ";", ":", "!", "''", "'", '"', "-", "(", ")")]
english_corpus_sans_stopwords = []
stopwords = nltk.corpus.stopwords.words('english')

for w in english_without_punc:
    if w not in stopwords:
        english_corpus_sans_stopwords.append(w)
print(english_corpus_sans_stopwords)

edited Aug 11 '17 at 23:07

answered Aug 11 '17 at 23:01

M3RS

6,720
6
37
47

Thank you very much! It works flawlessly)) – Miss Alena Aug 12 '17 at 07:18
You are welcome, the trick was just to use `word_tokenize`, which takes care of the heavy lifting for you :) – M3RS Aug 12 '17 at 08:49

nltk stopwords removal gives the wrong output

1 Answers1