5

I am using 'lda' package in R for topic modeling. I want to predict new topics(collection of related words in a document) using a fitted Latent Dirichlet Allocation(LDA) model for new dataset. In the process, I came across predictive.distribution() function. But the function takes document_sums as input parameter which is an output of the result after fitting the new model. I need help to understand the use of existing model on new dataset and predict topics. Here is the example code present in the documentation written by Johnathan Chang for the package: Here is the code for it:

#Fit a model
data(cora.documents)
data(cora.vocab)

K <- 10 ## Num clusters

result <- lda.collapsed.gibbs.sampler(cora.documents,K, cora.vocab,25, 0.1, 0.1) 

# Predict new words for the first two documents
predictions <-  predictive.distribution(result$document_sums[,1:2], result$topics, 0.1, 0.1)

# Use top.topic.words to show the top 5 predictions in each document.
top.topic.words(t(predictions), 5)

Any help will be appreciated

Thanks & Regards,

Ankit

ankit sethi
  • 51
  • 1
  • 3
  • You might want to search SO with this strategy: [r] "reproducible example". (I was not the source of the downvote but my suspicion is that the person who did so felt the question was too vague to admit much of an answer that used coding as an end-point.) – IRTFM May 07 '12 at 14:51
  • @DWin thanks for the advice, but my doubt is not in regards to have a reproducible code rather I wanted to know that can I use predictive.distribution() function on new dataset on which I havent fitted my model, if not so, then is there a way in which I may use my existing model on new dataset. Please excuse me for the lack of detail in my original post, as I am new to programming in general and posting my doubts on forum. – ankit sethi May 07 '12 at 15:28
  • The issue is not "reproducibility" but specificity. Your question has no code and no example data. I doubt that the term "topic" is something that is specific to LDA. You need to provide background and construction of a dataset that matches that specificity. This is a coding site. I'm going to add a downvote which I will remove if there is data and code when I come back. – IRTFM May 07 '12 at 15:33
  • @DWin I hope this makes my question clear. – ankit sethi May 07 '12 at 17:20
  • This question has been asked and answered here:http://stackoverflow.com/a/16120518/1036500 – Ben Apr 21 '13 at 08:09

1 Answers1

2

I don't know how you can achieve this in R but please have a look at a 2009 publication by Wallach et. al. titled 'Evaluation Methods for Topic Models' here. Have a look at section 4, it mentions three methods to calculate P(z|w), one based on importance sampling and other two called 'Chib-style estimator' and 'left-to-right estimator'.

Mallet has implementation of left-to-right estimator method

abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54