I'm new to the gensim package and vector space models in general, and I'm unsure of what exactly I should do with my LSA output.
To give a brief overview of my goal, I'd like to enhance Naive Bayes Classifier using topic modeling to improve classification of reviews (positive or negative). Here's a great paper I've been reading that has shaped my ideas but left me still somewhat confused about implementation..
I've already got working code for Naive Bayes--currently, I'm just using unigram bag of words as my features and labels are either positive or negative.
Here's my gensim code
from pprint import pprint # pretty printer
import gensim as gs
# tutorial sample documents
docs = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# stoplist removal, tokenization
stoplist = set('for a of the and to in'.split())
# for each document: lowercase document, split by whitespace, and add all its words not in stoplist to texts
texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in docs]
# create dict
dict = gs.corpora.Dictionary(texts)
# create corpus
corpus = [dict.doc2bow(text) for text in texts]
# tf-idf
tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
# latent semantic indexing with 10 topics
lsi = gs.models.LsiModel(corpus_tfidf, id2word=dict, num_topics =10)
for i in lsi.print_topics():
print i
Here's output
0.400*"system" + 0.318*"survey" + 0.290*"user" + 0.274*"eps" + 0.236*"management" + 0.236*"opinion" + 0.235*"response" + 0.235*"time" + 0.224*"interface" + 0.224*"computer"
0.421*"minors" + 0.420*"graph" + 0.293*"survey" + 0.239*"trees" + 0.226*"paths" + 0.226*"intersection" + -0.204*"system" + -0.196*"eps" + 0.189*"widths" + 0.189*"quasi"
-0.318*"time" + -0.318*"response" + -0.261*"error" + -0.261*"measurement" + -0.261*"perceived" + -0.261*"relation" + 0.248*"eps" + -0.203*"opinion" + 0.195*"human" + 0.190*"testing"
0.416*"random" + 0.416*"binary" + 0.416*"generation" + 0.416*"unordered" + 0.256*"trees" + -0.225*"minors" + -0.177*"survey" + 0.161*"paths" + 0.161*"intersection" + 0.119*"error"
-0.398*"abc" + -0.398*"lab" + -0.398*"machine" + -0.398*"applications" + -0.301*"computer" + 0.242*"system" + 0.237*"eps" + 0.180*"testing" + 0.180*"engineering" + 0.166*"management"
Any suggestions or general comments would be appreciated.