I want to prevent certain phrases for creeping into my models. For example, I want to prevent 'red roses' from entering into my analysis. I understand how to add individual stop words as given in Adding words to scikit-learn's CountVectorizer's stop list by doing so:
from sklearn.feature_extraction import text
additional_stop_words=['red','roses']
However, this also results in other ngrams like 'red tulips' or 'blue roses' not being detected.
I am building a TfidfVectorizer as part of my model, and I realize the processing I need might have to be entered after this stage but I am not sure how to do this.
My eventual aim is to do topic modelling on a piece of text. Here is the piece of code (borrowed almost directly from https://de.dariah.eu/tatom/topic_model_python.html#index-0 ) that I am working on:
from sklearn import decomposition
from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']
sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=5
)
dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
n_topics=num_topics,
random_state=1
)
doctopic = m_clf.fit_transform(dtm)
topic_words = []
for topic in m_clf.components_:
word_idx = np.argsort(topic)[::-1][0:num_top_words]
topic_words.append([vocab[i] for i in word_idx])
doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))
EDIT
Sample dataframe (I have tried to insert as many edge cases as possible), df:
Content
0 I like red roses as much as I like blue tulips.
1 It would be quite unusual to see red tulips, but not RED ROSES
2 It is almost impossible to find blue roses
3 I like most red flowers, but roses are my favorite.
4 Could you buy me some red roses?
5 John loves the color red. Roses are Mary's favorite flowers.