Add/remove custom stop words with spacy

Question

What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!

The complete list: `from spacy.en.word_sets import STOP_WORDS` — Xeoncross, Sep 06 '17 at 02:55

Romain · Answer 1 · 2019-12-06T15:17:50.113

70

Using Spacy 2.0.11, you can update its stopwords set using one of the following:

To add a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")

To add several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}

To remove a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")

To remove several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}

Note: To see the current set of stopwords, use:

print(nlp.Defaults.stop_words)

Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path") and nlp.from_disk("/path") (further described at https://spacy.io/usage/saving-loading).

edited Dec 06 '19 at 15:17

answered Aug 01 '18 at 06:49

Romain

829
6
8

6

@AustinT It is syntactic sugar to obtain the union of two sets, `a|=b` being equivalent to `a=a.union(b)`. Similarly, the operator `-=` allows to perform a set difference. The curly bracket syntax allows to create sets in a simple way, `a={1,2,3}` being equivalent to `a=set(1,2,3)`. – Romain Oct 07 '18 at 19:57
2

This doesn't actually affect the model. – fny Dec 05 '19 at 14:18
2

I mean that it actually doesn't seem to affect the current execution either. (Maybe I'm running something out of order.) The other method seems foolproof. – fny Dec 07 '19 at 19:11
2

I concur with @fny. While this adds the stopwords to nlp.Defaults.stop_word, if you check that word with token.is_stop, you still get False. – Toby Jun 11 '20 at 07:10
Like others, I've found that this approach does not update `is_stop` e.g. nlp.Defaults.stop_words.add('foo'); nlp.vocab['foo'].is_stop returns False – Peter Aug 04 '22 at 20:04

dantiston · Accepted Answer · 2018-03-06T23:00:55.790

53

You can edit them before processing your text like this (see this post):

>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True

Note: This seems to work <=v1.8. For newer versions, see other answers.

edited Mar 06 '18 at 23:00

answered Dec 15 '16 at 19:52

dantiston

5,161
2
26
30

1

Ah nice. Thank you! – E.K. Dec 15 '16 at 20:28
1

This solution does not seem to be working anymore with version 1.9.0? I am getting `TypeError: an integer is required` – E.K. Sep 12 '17 at 20:31
1

@E.K. the reason for the error is because the vocab input word should be unicode (use u"the" instead of "the") – Eb Abadi Jan 18 '18 at 19:23

petezurich · Answer 3 · 2022-12-23T20:50:38.643

Short answer for version 2.0 and above (just tested with 3.4+):

from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS) # <- set of Spacy's default stop words

STOP_WORDS.add("your_additional_stop_word_here")

This loads all stop words as a set.
You can add your stop words to STOP_WORDS or use your own list in the first place.

To check if the attribute is_stop for the stop words is set to True use this:

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    print(lexeme.text, lexeme.is_stop)

In the unlikely case that stop words for some reason aren't set to is_stop = True do this:

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

Detailed explanation step by step with links to documentation.

First we import spacy:

import spacy

To instantiate class Language as nlp from scratch we need to import Vocab and Language. Documentation and example here.

from spacy.vocab import Vocab
from spacy.language import Language

# create new Language object from scratch
nlp = Language(Vocab())

stop_words is a default attribute of class Language and can be set to customize the default language data. Documentation here. You can find spacy's GitHub repo folder with defaults for various languages here.

For our instance of nlp we get 0 stop words which is reasonable since we haven't set any language with defaults

print(f"Language instance 'nlp' has {len(nlp.Defaults.stop_words)} default stopwords.")
>>> Language instance 'nlp' has 0 default stopwords.

Let's import English language defaults.

from spacy.lang.en import English

Now we have 326 default stop words.

print(f"The language default English has {len(spacy.lang.en.STOP_WORDS)} stopwords.")
print(sorted(list(spacy.lang.en.STOP_WORDS))[:10])
>>> The language default English has 326 stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']

Let's create a new instance of Language, now with defaults for English. We get the same result.

nlp = English()
print(f"Language instance 'nlp' now has {len(nlp.Defaults.stop_words)} default stopwords.")
print(sorted(list(nlp.Defaults.stop_words))[:10])
>>> Language instance 'nlp' now has 326 default stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']

To check if all words are set to is_stop = True we iterate over the stop words, retrieve the lexeme from vocab and print out the is_stop attribute.

[nlp.vocab[word].is_stop for word in nlp.Defaults.stop_words][:10]
>>> [True, True, True, True, True, True, True, True, True, True]

We can add stopwords to the English language defaults.

spacy.lang.en.STOP_WORDS.add("aaaahhh-new-stopword")
print(len(spacy.lang.en.STOP_WORDS))
# these propagate to our instance 'nlp' too! 
print(len(nlp.Defaults.stop_words))
>>> 327
>>> 327

Or we can add new stopwords to instance nlp. However, these propagate to our language defaults too!

nlp.Defaults.stop_words.add("_another-new-stop-word")
print(len(spacy.lang.en.STOP_WORDS))
print(len(nlp.Defaults.stop_words))
>>> 328
>>> 328

The new stop words are set to is_stop = True.

print(nlp.vocab["aaaahhh-new-stopword"].is_stop)
print(nlp.vocab["_another-new-stop-word"].is_stop)
>>> True
>>> True

did that with version 2.0 and got "ImportError: No module named en.stop_words"...suggestions? — user1025852, Nov 22 '17 at 22:19
@user1025852 Unfortunately I cannot replicate your error. My code still works fine (now even using spacy 3.4.x). — petezurich, Dec 23 '22 at 15:32

score 5 · Answer 4 · edited Mar 26 '19 at 18:39

5

For 2.0 use the following:

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

edited Mar 26 '19 at 18:39

Community

1
1

answered Mar 25 '18 at 09:55

harryhorn

892
6
8

2

You are showing how to fix a broken model as per [this bug/workaround](https://archive.is/HI5ZQ#selection-1231.0-1263.4). Whilst it is easy to adapt this for the OP needs you could have expanded on why you are writing the code this way: it is currently required because of the bug, but it's an otherwise redundant step, as `les.is_stop` should already be `True` in the bug-free future. – lucid_dreamer May 18 '18 at 07:28

score 4 · Answer 5 · edited May 20 '20 at 16:56

4

This collects the stop words too :)

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

edited May 20 '20 at 16:56

Davide Fiocco

5,350
5
35
72

answered Aug 23 '19 at 12:10

SolitaryReaper

1,823
1
8
13

score 0 · Answer 6 · answered Sep 20 '19 at 11:46

0

In latest version following would remove the word out of the list:

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')

answered Sep 20 '19 at 11:46

Sezin

1

score 0 · Answer 7 · answered Mar 04 '21 at 21:32

For version 2.3.0 If you want to replace the entire list instead of adding or removing a few stop words, you can do this:

custom_stop_words = set(['the','and','a'])

# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words

# Now load your model
nlp = spacy.load('en_core_web_md')

The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.

score 0 · Answer 8 · answered Jan 03 '23 at 05:19

See below piece of code

# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)

len(nlp.Defaults.stop_words)

# Make list of word you want to add to stop words
list = ['apple', 'ball', 'cat']

# Iterate this in loop

for item in list:
    # Add the word to the set of stop words. Use lowercase!
    nlp.Defaults.stop_words.add(item)
    
    # Set the stop_word tag on the lexeme
    nlp.vocab[item].is_stop = True

Hope this helps. You can print length before and after to confirm.

Add/remove custom stop words with spacy

8 Answers8

Linked