3

I'm using CountVectorizer to tokenize text and I want to add my own stop words. Why this doesn't work? The word 'de' shouldn't be in the final print.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
word_tokenizer = vectorizer.build_tokenizer()
print (word_tokenizer(u'Isto é um teste de qualquer coisa.'))

[u'Isto', u'um', u'teste', u'de', u'qualquer', u'coisa']
Miguel
  • 2,738
  • 3
  • 35
  • 51
  • I've never used this library before, but the documentation says `stop_words` is supposed to be a list. Have you tried just `stop_words=[u'de']`? – Tagc Jan 17 '17 at 16:15
  • Is http://stackoverflow.com/questions/24386489/adding-words-to-scikit-learns-countvectorizers-stop-list#24386751 useful? – fredtantini Jan 17 '17 at 16:15
  • Yes @Tagc, that was my first try. But then saw this http://stackoverflow.com/questions/40124476/how-to-set-custom-stop-words-for-sklearn-countvectorizer – Miguel Jan 17 '17 at 16:17
  • However, it doesn't work neither. – Miguel Jan 17 '17 at 16:18

1 Answers1

2
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
word_tokenizer = vectorizer.build_tokenizer()

In [7]: vectorizer.vocabulary_
Out[7]: {u'coisa': 0, u'isto': 1, u'qualquer': 2, u'teste': 3, u'um': 4}

you can see that u'de' is not in the computed vocabulary...

The method build_tokenizer just tokenized your string, removing the stopwordsis supposed to be done afterwards

from source code of the CountVectorizer :

def build_tokenizer(self):
    """Return a function that splits a string into a sequence of tokens"""
    if self.tokenizer is not None:
        return self.tokenizer
    token_pattern = re.compile(self.token_pattern)
    return lambda doc: token_pattern.findall(doc)

A solution to your problem can be :

vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
sentence = [u'Isto é um teste de qualquer coisa.']
tokenized = vectorizer.fit_transform(sentence)
result = vectorizer.inverse_transform(tokenized)

In [12]: result
Out[12]: 
[array([u'isto', u'um', u'teste', u'qualquer', u'coisa'], 
       dtype='<U8')]
arthur
  • 2,319
  • 1
  • 17
  • 24