Why not all stop words are removed from a list

Question

I want to remove stop words in a given list from the list of words that I created splitting a text by space to count top most frequent words. However not all stop words are removed I do not understand why.

I defined a function (split_into_words) to split text x into words using re.split(" ", x):

wordsList= split_into_words(x)  
wordsList = [item.replace("\n"," ") for item in wordsList] 

stopwords = open('stopword.txt') .read() 

new_list = []
for w in wordsList:
   if not w.lower () in stopwords:
    new_list.append(w)
print(new_list)

The list still includes many stop words and they appear among frequent 15 (among them of, by, the and other words).

`in stopwords` is checking substrings of one string, not individual words. Perhaps you want this to be a list (or a set)? — OneCricketeer, Dec 19 '21 at 14:17
Well, `.read()` returns a single string, not a list of words. Without seeing what your file looks like, you can either split the lines or commas, or whatever delimiter the words are separated by — OneCricketeer, Dec 19 '21 at 14:33
stopwords = open('stopwordlist.txt') .read() stopwords=[] stopwords=[stopwords.replace("\n", " ") for i in stopwords] stopwords= [stopwords.split(" ") for i in stopwords] I still have the stopwords they even appear more frequent. — despina, Dec 19 '21 at 14:55
Please edit your question to include a [mcve] of the file contents — OneCricketeer, Dec 19 '21 at 14:55
With `stopwords = open('stopwordlist.txt').read(); stopwords=[]`, you've just overwrote the string with an empty list. The following two lines then do nothing. Maybe start here https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list — OneCricketeer, Dec 19 '21 at 15:29

score 0 · Answer 1 · answered Dec 19 '21 at 15:32

If I understood your question correctly, here is what I’d do.

With moby.txt a text file containing text from Moby Dick and stopwords.txt another file with the following content:

and or
in

We’ll remove those words from the text of moby.txt.

# read moby.txt to a list of words
with open("moby.txt", "r") as f:
    moby = f.read().split()

# read the stop word list from stopwords.txt
with open("stopwords.txt", "r") as f:
    stopwords = f.read().split()

# now remove stop words from moby
clean_moby = []
for word in moby:
    if word.lower() not in stopwords:
        clean_moby.append(word)

# you can also do the above in a list comprehension
#clean_moby = [word for word in moby if word.lower() not in stopwords]

print(clean_moby)

Thank you all I think I managed! – despina Dec 19 '21 at 16:44 — despina, Dec 19 '21 at 16:44

Why not all stop words are removed from a list

1 Answers1