Adding words to scikit-learn's CountVectorizer's stop list

Question

Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?

Do you mean you want the default `'english'` `stop_words` plus some extras of your own? — jonrsharpe, Jun 24 '14 at 12:24

jonrsharpe · Accepted Answer · 2023-02-18T20:18:07.423

66

According to the source code for sklearn.feature_extraction.text, the full list (actually a frozenset, from stop_words) of ENGLISH_STOP_WORDS is exposed through __all__. Therefore if you want to use that list plus some more items, you could do something like:

from sklearn.feature_extraction import text 

stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

(where my_additional_stop_words is any sequence of strings) and use the result as the stop_words argument. This input to CountVectorizer.__init__ is parsed by _check_stop_list, which will pass the new frozenset straight through.

edited Feb 18 '23 at 20:18

answered Jun 24 '14 at 12:33

jonrsharpe

115,751
26
228
437

2

it's interesting to note there are only 318 stopwords in the set. Maybe these pre-supplied stopwords need to be expanded by the person using it. – Monica Heddneck Jan 18 '16 at 08:39
Works very well with CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS.union(array_example)) – Pablo Díaz Jun 02 '20 at 22:10
I tried to use this code but it did not work for me. Here is a reproducible example: my_text = ['John I hope you like it',"Tyler place is near by"] stop_words =text.ENGLISH_STOP_WORDS.union("john") count_vectorizer = CountVectorizer(stop_words = 'english') vec = count_vectorizer.fit(my_text) bag_of_words = vec.transform(my_text) sum_words = bag_of_words.sum(axis=0) #sum_words is a 1xn_words matrix without labels words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] sublst = sorted(words_freq, key = lambda x: x[1], reverse=True) sublst still has john – seakyourpeak Mar 25 '22 at 18:14
@seakyourpeak you're not actually _using_ your custom set of stop words... – jonrsharpe Mar 25 '22 at 18:16
Being new to python, I am not able to figure out how I can use it. I thought CountVectorizer(stop_words = 'english') means using the stop_words that I already have augmented with my own list. Thx in advance if you can show in code how to actually use it. – seakyourpeak Mar 25 '22 at 19:07
1

@seakyourpeak note that the original stop words are a _frozenset_, which is immutable. This creates a _new set_, it doesn't change the old one. – jonrsharpe Mar 26 '22 at 09:40
1

the updated link to stop_words is: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/_stop_words.py (had to comment because edit queue was overflowing) – WiccanKarnak Feb 18 '23 at 15:14

Adding words to scikit-learn's CountVectorizer's stop list

1 Answers1

Linked