-1

I have the list with words. I would like to count and check the most common words.

['project',
 'gutenberg',
 'ebook',
 'oliver',
 'twist',
 'may',......]

I have deleted stopwords from my list:

from nltk.corpus import stopwords

data2 = data.split()
for x in data2:
    if x == "":
        data2.remove("")
    elif x in stopwords.words('english'):
        data2.remove(x)

When I would like to see results. It's great but I would like to sort the words.

from collections import Counter
Counter(data2)
Counter({'project': 88,
         'gutenberg': 98,
         'ebook': 13,
         'oliver': 881,
         'twist': 68,

Why I get stopwords? How to solve that?

Counter(data2).most_common(10)
[('the', 4746),
 ('a', 1943),
 ('said', 1232),
pfabri
  • 885
  • 1
  • 9
  • 25
  • This isn't really a question about Python, but rather the `nltk` package, which I've now added for you. – pfabri Feb 09 '21 at 17:45
  • 1
    You are iterating over a list while modifying its content. That is very wrong. Use `enumerate()` instead. `for i, x in enumerate(data2):`. https://docs.python.org/3/library/functions.html#enumerate – alec_djinn Feb 09 '21 at 17:49
  • 1
    Assuming `data2` is a list of words formed from `data` being a space separated string? Its not best practice to modify a variable whilst you are iterating over it, so here you could do `clean_data = [x for x in data2 if (x != '' and x not in stopwords.words('English'))]` and then the function works – NickHilton Feb 09 '21 at 17:49

3 Answers3

2

It is best practice never to mutate a list inside a for/while loop while iterating through your list. Example: Let's say you want to remove elements equal to 3 or 4 from the List [ 1, 2, 3, 4, 5, 3, 9].
What you are currently doing is:

L = [ 1, 2, 3, 4, 5, 3, 9]
to_remove = [3, 4]
for x in L:
    if x in remove:
        L.remove(x)
print(L) # will return -> [1, 2, 4, 5, 9]

What you really want is :

L = [x for x in L if x not in to_remove]  # will return -> [1, 2, 5, 9]

Applying this logic to your code would give:

data2 = [x for x in data2 if x != "" and x not in stopwords.words('english')]
RobBlanchard
  • 855
  • 3
  • 17
0

I run the first part of your code, and I am able to remove the stopwords.

from nltk.corpus import stopwords

data = "This is an answer or an answer"
data2 = data.split()
for x in data2:
    if x == "":
        data2.remove("")
    elif x in stopwords.words('english'):
        data2.remove(x)
print(data2)

As an output I am getting: ['This', 'an', 'answer', 'an', 'answer']. Perhaps there is some problem with your indentation or how you are returning the data structure. I would suggest the following snippet:

from nltk.corpus import stopwords

eng_stopwords = stopwords.words('english')

def remove_stopwords(sentence):    
    array_of_words = data.split()
    array_cleaned_words = list()
    for word in array_of_words:
        if len(word) == 0 or word in eng_stopwords:
            continue
        array_cleaned_words.append(word) # adding word only if not stopword
    return array_cleaned_words

Now you should be able to have the list of words without stopwords, and you can apply a simple word count with collections

purple_lolakos
  • 456
  • 5
  • 15
0

Your code is a bit inefficient, you don't really need imports.

List=[...]

# Getting rid of empty strings and stop words:

to_remove=["","said","the","a"]

for i in to_remove:
    while i in List:
        List.remove(i)

# etc

# Getting top 10 most occurring words:

most_occurring_words=[]

for i in range(11):
    word=max(List,key=List.count) # gets the top most occurring word
    # remove the word from the list so we can get the second top etc
    while word in List:
        List.remove(word)

print(most_occurring_words)

Let me know if you get any errors.

yungmaz13
  • 139
  • 11
  • "inefficient, you don't really need imports" -> The OP wants to incorporate `nltk` [stopwords](https://gist.github.com/sebleier/554280). Hard-coding such a list is cumbersome and isn't friendly when dealing with multiple languages – Wondercricket Feb 09 '21 at 18:04
  • @Wondercricket Why not? – yungmaz13 Feb 09 '21 at 18:20