How to remove stop words using nltk or python

Question

I have a dataset from which I would like to remove stop words.

I used NLTK to get a list of stop words:

from nltk.corpus import stopwords

stopwords.words('english')

Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?

@MattO'Brien `from nltk.corpus import stopwords` for future googlers — danodonovan, May 13 '15 at 21:11
It is also necessary to run `nltk.download("stopwords")` in order to make the stopword dictionary available. — sffc, Jul 10 '15 at 17:12
See also http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk — alvas, Aug 25 '16 at 13:05
Pay attention that a word like "not" is also considered a stopword in nltk. If you do something like sentiment analysis, spam filtering, a negation may change the entire meaning of the sentence and if you remove it from the processing phase, you might not get accurate results. — anegru, Jun 04 '19 at 12:08

score 246 · Answer 1 · edited Nov 12 '15 at 15:29

246

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

edited Nov 12 '15 at 15:29

Stefan Falk

23,898
50
191
378

answered Mar 30 '11 at 12:53

Daren Thomas

67,947
40
154
200

Thanks to both answers, they both work although it would seem i have a flaw in my code preventing the stop list from working correctly. Should this be a new question post? not sure how things work around here just yet! – Alex Mar 30 '11 at 14:29
61

To improve performance, consider ```stops = set(stopwords.words("english"))``` instead. – isakkarlsson Sep 07 '13 at 22:04
2

>>> import nltk >>> nltk.download() [Source](http://www.nltk.org/data.html) – Dec 14 '17 at 20:33
5

`stopwords.words('english')` are lower case. So make sure to use only lower case words in the list e.g. `[w.lower() for w in word_list]` – Alex Aug 24 '18 at 18:10

score 19 · Answer 2 · edited Mar 28 '18 at 01:27

19

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

edited Mar 28 '18 at 01:27

Tannaz Khaleghi

23
4

answered Oct 27 '17 at 14:31

sumitjainjr

741
1
8
28

2

I'm getting `len(get_stop_words('en')) == 174` vs `len(stopwords.words('english')) == 179` – rubencart Mar 05 '20 at 21:26
Iteration through a list is not efficient. – Роман Коптев Jun 29 '21 at 10:09

score 19 · Answer 3 · answered Mar 26 '12 at 22:25

19

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

answered Mar 26 '12 at 22:25

David Lemphers

3,568
3
18
10

24

Note: this converts the sentence to a SET which removes all the duplicate words and therefore you will not be able to use frequency counting on the result – David Dehghan Feb 21 '17 at 23:59
2

converting to a set might remove viable information from the sentence by scraping multiple occurrences of an important word. – Ujjwal Nov 28 '19 at 03:57

score 15 · Answer 4 · answered Mar 30 '11 at 12:51

15

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

answered Mar 30 '11 at 12:51

das_weezul

6,082
2
28
33

7

this will be a whole lot slower than Daren Thomas's list comprehension... – drevicko Aug 26 '16 at 10:54

score 11 · Answer 5 · answered Sep 22 '19 at 12:13

There's a very simple light-weight python package stop-words just for this sake.

Fist install the package using: pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

score 6 · Answer 6 · answered Feb 08 '20 at 21:01

6

Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

answered Feb 08 '20 at 21:01

justadev

1,168
1
17
32

Don't use this approach in french l' or else will not be capture. – David Beauchemin Feb 22 '20 at 19:27

Yugant Hadiyal · Answer 7 · 2019-02-12T12:42:36.640

Use textcleaner library to remove stopwords from your data.

Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

Follow these steps to do so with this library.

pip install textcleaner

After installing:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Use above code to remove the stop-words.

score 2 · Answer 8 · answered Feb 24 '21 at 12:55

Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.

In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.

The library is called 'textfeatures'. You can use it as follows:

! pip install textfeatures
import textfeatures as tf
import pandas as pd

For example, suppose you have the following set of strings:

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

Now, call the stopwords() function and pass the parameters you want:

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

The result is going to be:

    text                                 stopwords
0   blue car and blue window             [and]
1   black crow in the window             [in, the]
2   i see my reflection in the window    [i, my, in, the]

As you can see, the last column has the stop words included in that docoument (record).

probably should not use alias tf, as this makes it looks like a new TensorFlow feature for many of us :-) — swygerts, Feb 20 '23 at 21:38

score 1 · Answer 9 · answered Jun 13 '17 at 15:48

you can use this function, you should notice that you need to lower all the words

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

score 1 · Answer 10 · answered Oct 02 '17 at 02:55

1

using filter:

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

answered Oct 02 '17 at 02:55

Saeid BK

309
4
7

3

if `word_list` is large this code is very slow. It is better to convert the stopwords list to a set before using it: `.. in set(stopwords.words('english'))`. – Robert Sep 23 '19 at 08:43

score 1 · Answer 11 · answered Jul 05 '20 at 08:27

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence)

score 1 · Answer 12 · edited Oct 13 '20 at 08:17

I will show you some example First I extract the text data from the data frame (twitter_df) to process further as following

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

Then to tokenize I use the following method

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

Then, to remove stop words,

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

I Think this will help you

score 0 · Answer 13 · answered Jun 02 '20 at 06:58

0

In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

answered Jun 02 '20 at 06:58

Jonathan Besomi

322
2
8

How to remove stop words using nltk or python

13 Answers13

Linked

Related