2

I am working on a LDA model with textmineR, have calculated coherence, log-likelihood measures and optimized my model.

As a last step I would like to see how well the model predicts topics on unseen data. Thus, I am using the predict() function from the textminer package in combination with GIBBS sampling on my testset-sample. This results in predicted "Theta" values for each document in my testset-sample.

While I have read in another post that perplexity-calculations are not available with the texminer package (See this post here: How do i measure perplexity scores on a LDA model made with the textmineR package in R?), I am now wondering what the purpose of the prediction function is then for? Especially with a large dataset of over 100.000 Documents it is hard to just visually assess whether the prediction has performed well or not.

I do not want to use perplexity for model selection (I am using coherence/log-likelihood instead), but as far as I understand, perplexity would help me to understand how well the prediction is and how "surprised" the model is with new, previously unseen data.

Since this does not seem to be available for textmineR, I am not sure how to assess the model prediction. Is there anything else that I could use to measure the prediction quality of my textminer model?

Thank you!

lole_emily
  • 95
  • 9
  • Program it yourself? It is not so hard I believe. – Karsten W. Jun 24 '20 at 11:26
  • Hello @KarstenW., unfortunately, I do not know what the best way is to assess prediction-quality of the textmineR model as well as not sure how to calculate it with my output. Hence my question above. – lole_emily Jun 27 '20 at 13:19
  • 1
    What is wrong with the solution given in the link you posted, i.e. `textvec::perplexity`? It seems that the `d ` variable in the answer is the data the model was trained on, but there should be no problem to enter unseen data instead. – Karsten W. Jun 27 '20 at 13:43
  • @KarstenW. I guess nothing is wrong with that solution. But since it is not naturally implemented in textmineR and since "Tommy Jones" said that there does not seem to be any "value of perplexity that you couldn't get with likelihood and coherence", I am wondering what would be an alternative instead of calculating perplexity to assess prediction quality. So: how could it be done/how was it meant to be assessed within the textmineR package? If I understand correctly, the statement above indicates that there might be a better way than perplexity to assess the prediction, but I have no clue how. – lole_emily Jun 27 '20 at 15:13
  • 1
    I think you could calculate the loglikelihood on the new data and that would be an equally valid measure of the quality of the model. – Karsten W. Jun 27 '20 at 19:21
  • I'd love to answer this, but I'm not sure what the question is. textmineR has three functions to calculate evaluation metrics for topic models, all three of which can be used on held out data or in-sample data.. `CalcLikelihood` calculates the likelihood. `CalcProbCoherence` calculates coherence of topics WRT a set of documents. `CalcTopicModelR2` gets an R-squared following . – Tommy Jones Jul 13 '20 at 14:03
  • And, FWIW, perplexity is just a function of likelihood. So in addition to using `text2vec::perplexity` you can calculate something like the below for model `m`: ` # likelihood likelihood_test <- CalcLikelihood( dtm = dtm_test, phi = m$phi, theta = theta_test, cpus = 2 ) # perplexity perplexity_in <- exp(-1 * likelihood_test / sum(dtm_test)) ` IMO perplexity doesn't tell you anything that likelihood doesn't. In spite of what the internet says about it being for "assessing predictions", it's not made for that any more or less than other metrics. – Tommy Jones Jul 13 '20 at 14:08

0 Answers0