3

Is there an inexpensive and easy way to prevent sklearn's CountVectorizer from only stopping unigrams with the stop_words parameter, and make it stop bigrams as well? What I mean is illustrated in the following snippet:

from sklearn.feature_extraction.text import CountVectorizer

texts = ['hello this is text number one yes yes',
        'hello this is text number two stackflow']

stop_words = {'hello this'}

model = CountVectorizer(analyzer='word', 
                        ngram_range=(1,2), 
                        max_features=3,
                        stop_words=stop_words)

doc_vectors = model.fit_transform(texts).toarray()
print(doc_vectors)
print(model.get_feature_names())

So what this code does, is output the following:

>>> [[1 1 1]
>>>  [1 1 1]]
>>> ['hello', 'hello this', 'is']

As you can see, I wanted the bigram 'hello this' to be counted out (it's fed to stop words). I've seen a few posts where they use pipelines or custom analyzers, and I've browsed through the documentation, but isn't there an easier way around this problem?

Thanks!

coyjedg
  • 41
  • 3
  • what is the output that you want ? – seralouk Nov 16 '17 at 23:34
  • I want the output to be, for example: `>>> [[1 1 1] >>> [1 1 1]] >>> ['hello', 'this is', 'is']` – coyjedg Nov 17 '17 at 07:23
  • This has been answered in [How to remove stop phrases/stop ngrams (multi-word strings) using pandas/sklearn?](https://stackoverflow.com/questions/45426215/how-to-remove-stop-phrases-stop-ngrams-multi-word-strings-using-pandas-sklearn). – sns Jul 03 '20 at 15:22

0 Answers0