Is there an inexpensive and easy way to prevent sklearn's CountVectorizer
from only stopping unigrams with the stop_words
parameter, and make it stop bigrams as well? What I mean is illustrated in the following snippet:
from sklearn.feature_extraction.text import CountVectorizer
texts = ['hello this is text number one yes yes',
'hello this is text number two stackflow']
stop_words = {'hello this'}
model = CountVectorizer(analyzer='word',
ngram_range=(1,2),
max_features=3,
stop_words=stop_words)
doc_vectors = model.fit_transform(texts).toarray()
print(doc_vectors)
print(model.get_feature_names())
So what this code does, is output the following:
>>> [[1 1 1]
>>> [1 1 1]]
>>> ['hello', 'hello this', 'is']
As you can see, I wanted the bigram 'hello this' to be counted out (it's fed to stop words). I've seen a few posts where they use pipelines or custom analyzers, and I've browsed through the documentation, but isn't there an easier way around this problem?
Thanks!