combining LSA/LSI with Naive Bayes for document classification

Question

I'm new to the gensim package and vector space models in general, and I'm unsure of what exactly I should do with my LSA output.

To give a brief overview of my goal, I'd like to enhance Naive Bayes Classifier using topic modeling to improve classification of reviews (positive or negative). Here's a great paper I've been reading that has shaped my ideas but left me still somewhat confused about implementation..

I've already got working code for Naive Bayes--currently, I'm just using unigram bag of words as my features and labels are either positive or negative.

Here's my gensim code

from pprint import pprint # pretty printer
import gensim as gs

# tutorial sample documents
docs = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]


# stoplist removal, tokenization
stoplist = set('for a of the and to in'.split())
# for each document: lowercase document, split by whitespace, and add all its words not in stoplist to texts
texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in docs]


# create dict
dict = gs.corpora.Dictionary(texts)
# create corpus
corpus = [dict.doc2bow(text) for text in texts]

# tf-idf
tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

# latent semantic indexing with 10 topics
lsi = gs.models.LsiModel(corpus_tfidf, id2word=dict, num_topics =10)

for i in lsi.print_topics():
    print i

Here's output

0.400*"system" + 0.318*"survey" + 0.290*"user" + 0.274*"eps" + 0.236*"management" + 0.236*"opinion" + 0.235*"response" + 0.235*"time" + 0.224*"interface" + 0.224*"computer"
0.421*"minors" + 0.420*"graph" + 0.293*"survey" + 0.239*"trees" + 0.226*"paths" + 0.226*"intersection" + -0.204*"system" + -0.196*"eps" + 0.189*"widths" + 0.189*"quasi"
-0.318*"time" + -0.318*"response" + -0.261*"error" + -0.261*"measurement" + -0.261*"perceived" + -0.261*"relation" + 0.248*"eps" + -0.203*"opinion" + 0.195*"human" + 0.190*"testing"
0.416*"random" + 0.416*"binary" + 0.416*"generation" + 0.416*"unordered" + 0.256*"trees" + -0.225*"minors" + -0.177*"survey" + 0.161*"paths" + 0.161*"intersection" + 0.119*"error"
-0.398*"abc" + -0.398*"lab" + -0.398*"machine" + -0.398*"applications" + -0.301*"computer" + 0.242*"system" + 0.237*"eps" + 0.180*"testing" + 0.180*"engineering" + 0.166*"management"

Any suggestions or general comments would be appreciated.

Did you solve your problem? I'm trying the same thing at the moment and i'm unsure too how to get the gensim lsi-model into the sklearn classifiers. — J-H, Jan 08 '16 at 14:35

score 0 · Answer 1 · answered Nov 01 '16 at 19:55

Just started working on the same problem, but with SVM instead, AFAIK after training your model you need to do something like this:

new_text = 'here is some document'
text_bow = dict.doc2bow(new_text)
vector = lsi[text_bow]

Where vector is a topic distribution in your document, with length equal to number of topics you choose for training, 10 in your case. So you need to represent all your documents as topic distributions and than feed them to classification algorithm.

P.S. I know it's kind of an old question, but I keep seeing it in google results every time I searching )

combining LSA/LSI with Naive Bayes for document classification

1 Answers1