Why this does not work? Stop words in CountVectorizer

Question

I'm using CountVectorizer to tokenize text and I want to add my own stop words. Why this doesn't work? The word 'de' shouldn't be in the final print.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
word_tokenizer = vectorizer.build_tokenizer()
print (word_tokenizer(u'Isto é um teste de qualquer coisa.'))

[u'Isto', u'um', u'teste', u'de', u'qualquer', u'coisa']

I've never used this library before, but the documentation says `stop_words` is supposed to be a list. Have you tried just `stop_words=[u'de']`? — Tagc, Jan 17 '17 at 16:15
Is http://stackoverflow.com/questions/24386489/adding-words-to-scikit-learns-countvectorizers-stop-list#24386751 useful? — fredtantini, Jan 17 '17 at 16:15
Yes @Tagc, that was my first try. But then saw this http://stackoverflow.com/questions/40124476/how-to-set-custom-stop-words-for-sklearn-countvectorizer — Miguel, Jan 17 '17 at 16:17

arthur · Accepted Answer · 2017-01-17T16:24:32.600

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
word_tokenizer = vectorizer.build_tokenizer()

In [7]: vectorizer.vocabulary_
Out[7]: {u'coisa': 0, u'isto': 1, u'qualquer': 2, u'teste': 3, u'um': 4}

you can see that u'de' is not in the computed vocabulary...

The method build_tokenizer just tokenized your string, removing the stopwordsis supposed to be done afterwards

from source code of the CountVectorizer :

def build_tokenizer(self):
    """Return a function that splits a string into a sequence of tokens"""
    if self.tokenizer is not None:
        return self.tokenizer
    token_pattern = re.compile(self.token_pattern)
    return lambda doc: token_pattern.findall(doc)

A solution to your problem can be :

vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
sentence = [u'Isto é um teste de qualquer coisa.']
tokenized = vectorizer.fit_transform(sentence)
result = vectorizer.inverse_transform(tokenized)

In [12]: result
Out[12]: 
[array([u'isto', u'um', u'teste', u'qualquer', u'coisa'], 
       dtype='<U8')]

Why this does not work? Stop words in CountVectorizer

1 Answers1