Q: 'Stopwords' getting ignored during LDA from scikit-learn

Question

I'm trying to run a LDA analysis from SKlearn on a list of danish reviews from trustpilot with the following code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction import text
    
x = pd.read_csv("output.csv")
my_stop_words = open('stopord.txt','r').read().split('\n',)   

number_topics = 15

no_features = 1000 

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words=my_stop_words)
tf = tf_vectorizer.fit_transform(x["review_body"].values.astype('U'))
tf_feature_names = tf_vectorizer.get_feature_names()
lda = LatentDirichletAllocation(n_components=number_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
print("LDA topics computed.")

While my stopwords gets read and run without problem i run in to two problems:

that my stopwords are ignored in my final output
I get the following message (in red) :

C:\Users\[USERNAME]\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. 
Tokenizing the stop words generated tokens ['bl', 'bã', 'ca', 'dan', 'derpã', 'eks', 'fã', 'gã', 'herpã', 'hvornã', 'ledes', 'lã', 'mã', 'ngere', 'nã', 'ogsã', 'pga', 'pã', 're', 'rende', 'ret', 'rst', 'ske', 'ste', 'sten', 'sã', 'vrigt', 'vã'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with

Maybe this helps - https://stackoverflow.com/questions/57340142/user-warning-your-stop-words-may-be-inconsistent-with-your-preprocessing — Mortz, Oct 06 '21 at 07:43
@Mortz Thanks! i've been through a lot of post but not that one yet. I dont know if it will work yet, but thanks for the help it looks promising — Kristoffer Larsen, Oct 06 '21 at 10:32

Q: 'Stopwords' getting ignored during LDA from scikit-learn

0 Answers0