What is the best way to add/remove stop words with spacy? I am using token.is_stop
function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!

- 5,350
- 5
- 35
- 72

- 4,179
- 8
- 30
- 50
-
6The complete list: `from spacy.en.word_sets import STOP_WORDS` – Xeoncross Sep 06 '17 at 02:55
8 Answers
Using Spacy 2.0.11, you can update its stopwords set using one of the following:
To add a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")
To add several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
To remove a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")
To remove several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}
Note: To see the current set of stopwords, use:
print(nlp.Defaults.stop_words)
Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path")
and nlp.from_disk("/path")
(further described at https://spacy.io/usage/saving-loading).

- 829
- 6
- 8
-
6@AustinT It is syntactic sugar to obtain the union of two sets, `a|=b` being equivalent to `a=a.union(b)`. Similarly, the operator `-=` allows to perform a set difference. The curly bracket syntax allows to create sets in a simple way, `a={1,2,3}` being equivalent to `a=set(1,2,3)`. – Romain Oct 07 '18 at 19:57
-
2
-
2I mean that it actually doesn't seem to affect the current execution either. (Maybe I'm running something out of order.) The other method seems foolproof. – fny Dec 07 '19 at 19:11
-
2I concur with @fny. While this adds the stopwords to nlp.Defaults.stop_word, if you check that word with token.is_stop, you still get False. – Toby Jun 11 '20 at 07:10
-
Like others, I've found that this approach does not update `is_stop` e.g. nlp.Defaults.stop_words.add('foo'); nlp.vocab['foo'].is_stop returns False – Peter Aug 04 '22 at 20:04
You can edit them before processing your text like this (see this post):
>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True
Note: This seems to work <=v1.8. For newer versions, see other answers.

- 5,161
- 2
- 26
- 30
-
1
-
1This solution does not seem to be working anymore with version 1.9.0? I am getting `TypeError: an integer is required` – E.K. Sep 12 '17 at 20:31
-
1@E.K. the reason for the error is because the vocab input word should be unicode (use u"the" instead of "the") – Eb Abadi Jan 18 '18 at 19:23
Short answer for version 2.0 and above (just tested with 3.4+):
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # <- set of Spacy's default stop words
STOP_WORDS.add("your_additional_stop_word_here")
- This loads all stop words as a set.
- You can add your stop words to
STOP_WORDS
or use your own list in the first place.
To check if the attribute is_stop
for the stop words is set to True
use this:
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
print(lexeme.text, lexeme.is_stop)
In the unlikely case that stop words for some reason aren't set to is_stop = True
do this:
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
lexeme.is_stop = True
Detailed explanation step by step with links to documentation.
First we import spacy:
import spacy
To instantiate class Language
as nlp
from scratch we need to import Vocab
and Language
. Documentation and example here.
from spacy.vocab import Vocab
from spacy.language import Language
# create new Language object from scratch
nlp = Language(Vocab())
stop_words
is a default attribute of class Language
and can be set to customize the default language data. Documentation here. You can find spacy's GitHub repo folder with defaults for various languages here.
For our instance of nlp
we get 0 stop words which is reasonable since we haven't set any language with defaults
print(f"Language instance 'nlp' has {len(nlp.Defaults.stop_words)} default stopwords.")
>>> Language instance 'nlp' has 0 default stopwords.
Let's import English language defaults.
from spacy.lang.en import English
Now we have 326 default stop words.
print(f"The language default English has {len(spacy.lang.en.STOP_WORDS)} stopwords.")
print(sorted(list(spacy.lang.en.STOP_WORDS))[:10])
>>> The language default English has 326 stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']
Let's create a new instance of Language
, now with defaults for English. We get the same result.
nlp = English()
print(f"Language instance 'nlp' now has {len(nlp.Defaults.stop_words)} default stopwords.")
print(sorted(list(nlp.Defaults.stop_words))[:10])
>>> Language instance 'nlp' now has 326 default stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']
To check if all words are set to is_stop = True
we iterate over the stop words, retrieve the lexeme from vocab
and print out the is_stop
attribute.
[nlp.vocab[word].is_stop for word in nlp.Defaults.stop_words][:10]
>>> [True, True, True, True, True, True, True, True, True, True]
We can add stopwords to the English language defaults.
spacy.lang.en.STOP_WORDS.add("aaaahhh-new-stopword")
print(len(spacy.lang.en.STOP_WORDS))
# these propagate to our instance 'nlp' too!
print(len(nlp.Defaults.stop_words))
>>> 327
>>> 327
Or we can add new stopwords to instance nlp
. However, these propagate to our language defaults too!
nlp.Defaults.stop_words.add("_another-new-stop-word")
print(len(spacy.lang.en.STOP_WORDS))
print(len(nlp.Defaults.stop_words))
>>> 328
>>> 328
The new stop words are set to is_stop = True
.
print(nlp.vocab["aaaahhh-new-stopword"].is_stop)
print(nlp.vocab["_another-new-stop-word"].is_stop)
>>> True
>>> True

- 9,280
- 9
- 43
- 57
-
3did that with version 2.0 and got "ImportError: No module named en.stop_words"...suggestions? – user1025852 Nov 22 '17 at 22:19
-
@user1025852 Unfortunately I cannot replicate your error. My code still works fine (now even using spacy 3.4.x). – petezurich Dec 23 '22 at 15:32
For 2.0 use the following:
for word in nlp.Defaults.stop_words:
lex = nlp.vocab[word]
lex.is_stop = True
-
2You are showing how to fix a broken model as per [this bug/workaround](https://archive.is/HI5ZQ#selection-1231.0-1263.4). Whilst it is easy to adapt this for the OP needs you could have expanded on why you are writing the code this way: it is currently required because of the bug, but it's an otherwise redundant step, as `les.is_stop` should already be `True` in the bug-free future. – lucid_dreamer May 18 '18 at 07:28
This collects the stop words too :)
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

- 5,350
- 5
- 35
- 72

- 1,823
- 1
- 8
- 13
In latest version following would remove the word out of the list:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')

- 1
For version 2.3.0 If you want to replace the entire list instead of adding or removing a few stop words, you can do this:
custom_stop_words = set(['the','and','a'])
# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words
# Now load your model
nlp = spacy.load('en_core_web_md')
The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.

- 418
- 4
- 12
See below piece of code
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)
len(nlp.Defaults.stop_words)
# Make list of word you want to add to stop words
list = ['apple', 'ball', 'cat']
# Iterate this in loop
for item in list:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add(item)
# Set the stop_word tag on the lexeme
nlp.vocab[item].is_stop = True
Hope this helps. You can print length before and after to confirm.

- 1
- 1