2

I post my question here because there are already some answers on how to use scikit methods with gensim like scikit vectorizers with gensim or this but I haven't seen the whole pipeline to be used for text classification. I will try to explain a little bit my situation

I want to use gensim LDA implemented methods in order to proceed further to text classification. I have one dataset which is consisted from three parts(train(25K), test(25K) and unlabeled data(50K)). What I am trying to do is to learn the latent topics space using the unlabeled data and then transform the train and test set into this learned latent topic space. I am currently using the Scikit Learn implemented methods in order to extract the BoW representation. Later, I am transforming to the required inputs for the LDA implementation and at the end I am transforming the train and test set into the extracted latent topic space. Finally, I am going back to csr matrices in order to fit a classifier and to obtain the accuracy. Although, it seems to me that everything is fine, the performance of the classifier is almost 0%. I am attaching part of the code in order to get some additional help or if there is something obvious that I am currently missing.

#bow representations for the three sets unlabelled, train and test
vectorizer = CountVectorizer(max_features=3000,stop_words='english')


corpus_tfidf_unsuper = vectorizer.fit_transform(train_data_unsupervised[:,2])
corpus_tfidf_train = vectorizer.transform(train_ds[:,2])
corpus_tfidf_test= vectorizer.transform(test_ds[:,2])

#transform to gensim acceptable objects
vocab = vectorizer.get_feature_names()
id2word_unsuper=dict([(i, s) for i, s in enumerate(vocab)])
corpus_vect_gensim_unsuper = matutils.Sparse2Corpus(corpus_tfidf_unsuper.T)
corpus_vect_gensim_train = matutils.Sparse2Corpus(corpus_tfidf_train.T)
corpus_vect_gensim_test = matutils.Sparse2Corpus(corpus_tfidf_test.T)

#fit the model to the unlabelled data
lda = models.LdaModel(corpus_vect_gensim_unsuper, 
                  id2word = id2word_unsuper, 
                  num_topics = 10,
                  passes=1)
#transform the train and test set to the latent topic space
docTopicProbMat_train = lda[corpus_vect_gensim_train]
docTopicProbMat_test = lda[corpus_vect_gensim_test]
#transform to csr matrices
train_lda=matutils.corpus2csc(docTopicProbMat_train)
test_lda=matutils.corpus2csc(docTopicProbMat_test)
#fit the classifier and print the accuracy
clf =LogisticRegression()    
clf.fit(train_lda.transpose(), np.array(train_ds[:,0]).astype(int))     
ypred = clf.predict(test_lda.transpose())
print accuracy_score(test_ds[:,0].astype(int), ypred)

This is my first post, so if there are potential remarks, please do not hesitate to inform me.

Community
  • 1
  • 1
bekou
  • 21
  • 1
  • 4
  • Highlight your question please! – eliasah Jul 31 '15 at 09:37
  • 1
    I changed the format of my question to bold. Is this what you have proposed me to do or is there something that I didn't understand? Because I can't see something else in order to highlight my question. – bekou Jul 31 '15 at 09:46
  • have you tried to perform a k-fold cross validation for testing your model? or re-sizing your training/testing set, like 70/30 instead of 50/50? – eliasah Jul 31 '15 at 11:56
  • I see what you mean. No, I haven't tried to perform cross-validation but I have tried several lda implementations and the accuracy(on the same dataset) is even better of the one obtained from raw features (term-document matrix). But my problem is in particular with the gensim implementation. I tried gensim because it seems to me that it is faster than the aforementioned lda implementations. Btw, I will perform a cross-validation but i think that the problem is something wrong inside the gensim object mappings and the back-forth transformation to csr – bekou Jul 31 '15 at 12:10
  • I'm not sure how is that is implemented in python, nevertheless I don't think that it's from the gensim library. It seems like you are using scikit to all the job for you. So where you might have possible issues is in the input format, the borders of your training/testing sets and it seems to me that you are performing **just one** pass to train your data which i'm sure it's not enough. You need to perform a Gridsearch over your model parameters to find what best suits your model – eliasah Jul 31 '15 at 12:16
  • 1
    When I say from the gensim library,I mean probably there is something wrong that I miss in the documentation. Concerning the passes, I have already increased the number but the results is the same - almost 0% accuracy. Btw, thanks for your feedback. – bekou Jul 31 '15 at 12:25
  • I'm sorry I can't help much with the code here (Not a fan of python). – eliasah Jul 31 '15 at 12:27
  • Nop, I' m using python just because of the simplicity to try quickly various things. – bekou Jul 31 '15 at 12:29

0 Answers0