-2

I want to remove stop words in a given list from the list of words that I created splitting a text by space to count top most frequent words. However not all stop words are removed I do not understand why.

I defined a function (split_into_words) to split text x into words using re.split(" ", x):

wordsList= split_into_words(x)  
wordsList = [item.replace("\n"," ") for item in wordsList] 

stopwords = open('stopword.txt') .read() 

new_list = []
for w in wordsList:
   if not w.lower () in stopwords:
    new_list.append(w)
print(new_list)  

The list still includes many stop words and they appear among frequent 15 (among them of, by, the and other words).

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
despina
  • 1
  • 1
  • `in stopwords` is checking substrings of one string, not individual words. Perhaps you want this to be a list (or a set)? – OneCricketeer Dec 19 '21 at 14:17
  • How to correct that! – despina Dec 19 '21 at 14:20
  • Well, `.read()` returns a single string, not a list of words. Without seeing what your file looks like, you can either split the lines or commas, or whatever delimiter the words are separated by – OneCricketeer Dec 19 '21 at 14:33
  • stopwords = open('stopwordlist.txt') .read() stopwords=[] stopwords=[stopwords.replace("\n", " ") for i in stopwords] stopwords= [stopwords.split(" ") for i in stopwords] I still have the stopwords they even appear more frequent. – despina Dec 19 '21 at 14:55
  • Please edit your question to include a [mcve] of the file contents – OneCricketeer Dec 19 '21 at 14:55
  • With `stopwords = open('stopwordlist.txt').read(); stopwords=[]`, you've just overwrote the string with an empty list. The following two lines then do nothing. Maybe start here https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list – OneCricketeer Dec 19 '21 at 15:29

1 Answers1

0

If I understood your question correctly, here is what I’d do.

With moby.txt a text file containing text from Moby Dick and stopwords.txt another file with the following content:

and or
in

We’ll remove those words from the text of moby.txt.

# read moby.txt to a list of words
with open("moby.txt", "r") as f:
    moby = f.read().split()

# read the stop word list from stopwords.txt
with open("stopwords.txt", "r") as f:
    stopwords = f.read().split()

# now remove stop words from moby
clean_moby = []
for word in moby:
    if word.lower() not in stopwords:
        clean_moby.append(word)

# you can also do the above in a list comprehension
#clean_moby = [word for word in moby if word.lower() not in stopwords]

print(clean_moby)
ljmc
  • 4,830
  • 2
  • 7
  • 26