I'm trying to run a LDA analysis from SKlearn on a list of danish reviews from trustpilot with the following code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction import text
x = pd.read_csv("output.csv")
my_stop_words = open('stopord.txt','r').read().split('\n',)
number_topics = 15
no_features = 1000
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words=my_stop_words)
tf = tf_vectorizer.fit_transform(x["review_body"].values.astype('U'))
tf_feature_names = tf_vectorizer.get_feature_names()
lda = LatentDirichletAllocation(n_components=number_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
print("LDA topics computed.")
While my stopwords gets read and run without problem i run in to two problems:
that my stopwords are ignored in my final output
I get the following message (in red) :
C:\Users\[USERNAME]\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing.
Tokenizing the stop words generated tokens ['bl', 'bã', 'ca', 'dan', 'derpã', 'eks', 'fã', 'gã', 'herpã', 'hvornã', 'ledes', 'lã', 'mã', 'ngere', 'nã', 'ogsã', 'pga', 'pã', 're', 'rende', 'ret', 'rst', 'ske', 'ste', 'sten', 'sã', 'vrigt', 'vã'] not in stop_words.
warnings.warn('Your stop_words may be inconsistent with