28

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)

vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000)
X= vectorizer.fit_transform(text)

I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.

Community
  • 1
  • 1
ac11
  • 927
  • 2
  • 11
  • 18
  • I did use your code and ran as [here](https://gist.github.com/anonymous/043a0099b4c388d0686d). I got the expected Result. Can you provide more details? – Gurupad Hegde Nov 09 '14 at 23:13
  • I am classifying tweets which contain urls. Now my features which I extract using SelectKBest contains those urls in pieces. So I thought of adding those urls in my stop word list so that it gets removed from my feature set. I added those urls as shown above. – ac11 Nov 11 '14 at 22:54
  • Here is how my stop word list looks like : frozenset(['', 'wA4qNj2o0b', 'all', 'fai5w3nBgo', 'Ikq7p9ElUW', '9W6GbM0MjL', 'four', 'WkOI43bsVj', 'x88VDFBzkO', 'whose', 'YqoLBzajjo', 'NVXydiHKSC', 'HdjXav51vI', 'q0YoiC0QCD', 'to', 'cTIYpRLarr', 'nABIG7dAlr', 'under', '6JF33FZIYU', 'very', 'AVFWjAWsbF']) – ac11 Nov 11 '14 at 22:55
  • And here is how my feature set looks like : [u'bcvjby2owk', u'cases bcvjby2owk', u'cases dgvsrqaw7p', u'dgvsrqaw7p', u'8dsto3yxi2', u'guardianafrica', u'guardianafrica guardian\xe2', u'guardianafrica guardian\xe2 nickswicks'] – ac11 Nov 11 '14 at 22:55
  • 3
    I could see that none of the stop words are appearing in the feature lists. So, reported behaviour is expected. Here, method used to filtering these hashes is wrong. If you pass random strings to vectorizer as stop words, it wont intelligently filter similar strings. Stop words are the exact/hard-coded strings to be filtered. Alternatively, you can use regex (before passing the text block to vectorizer) to filter all the urls which are not required. This may solve your problem with urls. – Gurupad Hegde Nov 12 '14 at 11:48
  • I think my example was a bit confusing...sorry about that. I have hardcoded each and every string in my_stop_words list, even then these string pops up in the feature list, just in lowercase as I have set lowercase=True in TfIdfVectorizer function. – ac11 Nov 14 '14 at 06:52
  • I think I found the problem. Its the lowercase=True parameter. All the strings in feature list is converted to lowercase but the strings in my_word_list is still case sensitive. So these were not removed from the feature list even if the same were present in my_word_list. Thanks for your help though. – ac11 Nov 14 '14 at 14:56
  • @ac11 It didn't work for me. What version of sklearn are you using? – Radu Gheorghiu Aug 22 '15 at 13:36
  • Hey... this was a course project I did in November last year. I even uninstalled sklearn. I don't know how else I can check that version. Sorry. – ac11 Aug 23 '15 at 03:23
  • Possible duplicate of [Adding words to scikit-learn's CountVectorizer's stop list](http://stackoverflow.com/questions/24386489/adding-words-to-scikit-learns-countvectorizers-stop-list) – Vivek Kumar Apr 25 '17 at 10:12

3 Answers3

25

This is how you can do it:

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer

my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])

vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)

X = vectorizer.fit_transform(["this is an apple.","this is a book."])

idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

# printing the tfidf vectors
print(X)

# printing the vocabulary
print(vectorizer.vocabulary_)

In this example, I created the tfidf vectors for two sample documents:

"This is a green apple."
"This is a machine learning book."

By default, this, is, a, and an are all in the ENGLISH_STOP_WORDS list. And, I also added book to the stop word list. This is the output:

(0, 1)  0.707106781187
(0, 0)  0.707106781187
(1, 3)  0.707106781187
(1, 2)  0.707106781187
{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}

As we can see, the word book is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.

Pedram
  • 2,421
  • 4
  • 31
  • 49
  • is there a way to remove stopwords from the ENGLISH_STOP_WORDS instead of adding them e.g. remove 'not' ? – Stamatis Tiniakos Jul 03 '19 at 14:00
  • 4
    @StamatisTiniakos There should be. ENGLISH_STOP_WORDS is of type: ``, so just as an example, you can use this set to create a new list and add or remove words from the list and then pass it to your vectorizer. – Pedram Feb 20 '20 at 23:37
5

This is answered here: https://stackoverflow.com/a/24386751/732396

Even though sklearn.feature_extraction.text.ENGLISH_STOP_WORDS is a frozenset, you can make a copy of it and add your own words, then pass that variable in to the stop_words argument as a list.

Community
  • 1
  • 1
yanhan
  • 3,507
  • 3
  • 28
  • 38
0

For use with scikit-learn you can always use a list as-well:

from nltk.corpus import stopwords
stop = list(stopwords.words('english'))
stop.extend('myword1 myword2 myword3'.split())


vectorizer = TfidfVectorizer(analyzer = 'word',stop_words=set(stop))
vectors = vectorizer.fit_transform(corpus)
...

The only downside of this method, over a set is that your list may end up containing duplicates, which is why I then convert it back when using it as an argument for TfidfVectorizer

user2589273
  • 2,379
  • 20
  • 29