Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?
Asked
Active
Viewed 2.6k times
37
-
Do you mean you want the default `'english'` `stop_words` plus some extras of your own? – jonrsharpe Jun 24 '14 at 12:24
-
this post has been a life saver. – TheM00s3 Mar 14 '17 at 17:23
1 Answers
66
According to the source code for sklearn.feature_extraction.text
, the full list (actually a frozenset
, from stop_words
) of ENGLISH_STOP_WORDS
is exposed through __all__
. Therefore if you want to use that list plus some more items, you could do something like:
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)
(where my_additional_stop_words
is any sequence of strings) and use the result as the stop_words
argument. This input to CountVectorizer.__init__
is parsed by _check_stop_list
, which will pass the new frozenset
straight through.

jonrsharpe
- 115,751
- 26
- 228
- 437
-
2it's interesting to note there are only 318 stopwords in the set. Maybe these pre-supplied stopwords need to be expanded by the person using it. – Monica Heddneck Jan 18 '16 at 08:39
-
Works very well with CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS.union(array_example)) – Pablo Díaz Jun 02 '20 at 22:10
-
I tried to use this code but it did not work for me. Here is a reproducible example: my_text = ['John I hope you like it',"Tyler place is near by"] stop_words =text.ENGLISH_STOP_WORDS.union("john") count_vectorizer = CountVectorizer(stop_words = 'english') vec = count_vectorizer.fit(my_text) bag_of_words = vec.transform(my_text) sum_words = bag_of_words.sum(axis=0) #sum_words is a 1xn_words matrix without labels words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] sublst = sorted(words_freq, key = lambda x: x[1], reverse=True) sublst still has john – seakyourpeak Mar 25 '22 at 18:14
-
@seakyourpeak you're not actually _using_ your custom set of stop words... – jonrsharpe Mar 25 '22 at 18:16
-
Being new to python, I am not able to figure out how I can use it. I thought CountVectorizer(stop_words = 'english') means using the stop_words that I already have augmented with my own list. Thx in advance if you can show in code how to actually use it. – seakyourpeak Mar 25 '22 at 19:07
-
1@seakyourpeak note that the original stop words are a _frozenset_, which is immutable. This creates a _new set_, it doesn't change the old one. – jonrsharpe Mar 26 '22 at 09:40
-
1the updated link to stop_words is: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/_stop_words.py (had to comment because edit queue was overflowing) – WiccanKarnak Feb 18 '23 at 15:14