139

I have a dataset from which I would like to remove stop words.

I used NLTK to get a list of stop words:

from nltk.corpus import stopwords

stopwords.words('english')

Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Alex
  • 1,853
  • 5
  • 16
  • 15
  • 4
    Where did you get the stopwords from? Is this from NLTK? – tumultous_rooster Apr 07 '14 at 22:15
  • 46
    @MattO'Brien `from nltk.corpus import stopwords` for future googlers – danodonovan May 13 '15 at 21:11
  • 15
    It is also necessary to run `nltk.download("stopwords")` in order to make the stopword dictionary available. – sffc Jul 10 '15 at 17:12
  • See also http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk – alvas Aug 25 '16 at 13:05
  • 7
    Pay attention that a word like "not" is also considered a stopword in nltk. If you do something like sentiment analysis, spam filtering, a negation may change the entire meaning of the sentence and if you remove it from the processing phase, you might not get accurate results. – anegru Jun 04 '19 at 12:08

13 Answers13

246
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
Daren Thomas
  • 67,947
  • 40
  • 154
  • 200
  • Thanks to both answers, they both work although it would seem i have a flaw in my code preventing the stop list from working correctly. Should this be a new question post? not sure how things work around here just yet! – Alex Mar 30 '11 at 14:29
  • 61
    To improve performance, consider ```stops = set(stopwords.words("english"))``` instead. – isakkarlsson Sep 07 '13 at 22:04
  • 2
    >>> import nltk >>> nltk.download() [Source](http://www.nltk.org/data.html) –  Dec 14 '17 at 20:33
  • 5
    `stopwords.words('english')` are lower case. So make sure to use only lower case words in the list e.g. `[w.lower() for w in word_list]` – Alex Aug 24 '18 at 18:10
19

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]
sumitjainjr
  • 741
  • 1
  • 8
  • 28
19

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
David Lemphers
  • 3,568
  • 3
  • 18
  • 10
  • 24
    Note: this converts the sentence to a SET which removes all the duplicate words and therefore you will not be able to use frequency counting on the result – David Dehghan Feb 21 '17 at 23:59
  • 2
    converting to a set might remove viable information from the sentence by scraping multiple occurrences of an important word. – Ujjwal Nov 28 '19 at 03:57
15

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
das_weezul
  • 6,082
  • 2
  • 28
  • 33
11

There's a very simple light-weight python package stop-words just for this sake.

Fist install the package using: pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian
user_3pij
  • 1,334
  • 11
  • 22
6

Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
justadev
  • 1,168
  • 1
  • 17
  • 32
4

Use textcleaner library to remove stopwords from your data.

Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

Follow these steps to do so with this library.

pip install textcleaner

After installing:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Use above code to remove the stop-words.

2

Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.

In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.

The library is called 'textfeatures'. You can use it as follows:

! pip install textfeatures
import textfeatures as tf
import pandas as pd

For example, suppose you have the following set of strings:

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

Now, call the stopwords() function and pass the parameters you want:

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

The result is going to be:

    text                                 stopwords
0   blue car and blue window             [and]
1   black crow in the window             [in, the]
2   i see my reflection in the window    [i, my, in, the]

As you can see, the last column has the stop words included in that docoument (record).

Taie
  • 1,021
  • 16
  • 29
  • probably should not use alias tf, as this makes it looks like a new TensorFlow feature for many of us :-) – swygerts Feb 20 '23 at 21:38
1

you can use this function, you should notice that you need to lower all the words

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list
1

using filter:

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
Saeid BK
  • 309
  • 4
  • 7
  • 3
    if `word_list` is large this code is very slow. It is better to convert the stopwords list to a set before using it: `.. in set(stopwords.words('english'))`. – Robert Sep 23 '19 at 08:43
1
from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 
H M
  • 89
  • 1
  • 6
1

I will show you some example First I extract the text data from the data frame (twitter_df) to process further as following

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

Then to tokenize I use the following method

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

Then, to remove stop words,

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

I Think this will help you

user_3pij
  • 1,334
  • 11
  • 22
0

In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])
Jonathan Besomi
  • 322
  • 2
  • 8