29

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.

I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set.

I found this code from one of CV's discussion sessions. I really don't understand several lines of code below. I have dtm matrix using the holdout set (20 documents). But I don't know how to calculate the perplexity or log likelihood of this holdout set.


Questions:

  1. Can anybody explain to me what seq(2, 100, by =1) mean here? Also, what AssociatedPress[21:30] mean? What function(k) is doing here?

     best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
    
  2. If I want to calculate perplexity or log likelihood of the holdout set called dtm, is there better code? I know there are perplexity() and logLik() functions but since I'm new I can not figure out how to implement it with my holdout matrix, called dtm.

  3. How can I do ten fold cross validation with my corpus, containing 200 documents? Is there existing code that I can invoke? I found caret for this purpose, but again cannot figure that out either.

starball
  • 20,030
  • 7
  • 43
  • 238
user37874
  • 415
  • 1
  • 5
  • 11
  • What in the documentation for `?seq` and `?AssociatedPress` and the other functions did you not understand? – probabilityislogic Jan 25 '14 at 22:58
  • I updated the code for this and saved as a gist. has plot method that prints by default. `devtools::source_url("https://gist.githubusercontent.com/trinker/594bd132b180a43945f7/raw/70fbb1aa2a9113837a9a9f8a6c43d884c2ef5bd0/optimal_k%25202")` +1 nice answer. – Tyler Rinker Dec 21 '15 at 05:21

2 Answers2

27

The accepted answer to this question is good as far as it goes, but it doesn't actually address how to estimate perplexity on a validation dataset and how to use cross-validation.

Using perplexity for simple validation

Perplexity is a measure of how well a probability model fits a new set of data. In the topicmodels R package it is simple to fit with the perplexity function, which takes as arguments a previously fit topic model and a new set of data, and returns a single number. The lower the better.

For example, splitting the AssociatedPress data into a training set (75% of the rows) and a validation set (25% of the rows):

# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)

data("AssociatedPress", package = "topicmodels")

burnin = 1000
iter = 1000
keep = 50

full_data  <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5

splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]

fitted <- LDA(train_set, k = k, method = "Gibbs",
                          control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300

The perplexity is higher for the validation set than the training set, because the topics have been optimised based on the training set.

Using perplexity and cross-validation to determine a good number of topics

The extension of this idea to cross-validation is straightforward. Divide the data into different subsets (say 5), and each subset gets one turn as the validation set and four turns as part of the training set. However, it's really computationally intensive, particularly when trying out the larger numbers of topics.

You might be able to use caret to do this, but I suspect it doesn't handle topic modelling yet. In any case, it's the sort of thing I prefer to do myself to be sure I understand what's going on.

The code below, even with parallel processing on 7 logical CPUs, took 3.5 hours to run on my laptop:

#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)

# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
   library(topicmodels)
})

folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics

# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))

# we parallelize by the different number of topics.  A processor is allocated a value
# of k, and does the cross-validation serially.  This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
   k <- candidate_k[j]
   results_1k <- matrix(0, nrow = folds, ncol = 2)
   colnames(results_1k) <- c("k", "perplexity")
   for(i in 1:folds){
      train_set <- full_data[splitfolds != i , ]
      valid_set <- full_data[splitfolds == i, ]

      fitted <- LDA(train_set, k = k, method = "Gibbs",
                    control = list(burnin = burnin, iter = iter, keep = keep) )
      results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
   }
   return(results_1k)
}
})
stopCluster(cluster)

results_df <- as.data.frame(results)

ggplot(results_df, aes(x = k, y = perplexity)) +
   geom_point() +
   geom_smooth(se = FALSE) +
   ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
           "(ie five different models fit for each candidate number of topics)") +
   labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")

We see in the results that 200 topics is too many and has some over-fitting, and 50 is too few. Of the numbers of topics tried, 100 is the best, with the lowest average perplexity on the five different hold-out sets.

enter image description here

Peter Ellis
  • 5,694
  • 30
  • 46
  • Did you normalize (i.e. divide) the perplexity by number of topics? – Evan Zamir Jun 09 '17 at 22:30
  • 5
    @EvanZamir why would you normalise the perplexity? Any reference of papers doing that? Purely out of curiosity. – francium87d Jun 27 '17 at 10:01
  • Thank you for this piece of code. I've written a variation on it in which I also loop over different values of alpha and delta. How do you know if you need a higher k or for example a lower alpha value..? I find it hard to find a structured way to do paramater tuning. – Nina van Bruggen Oct 27 '21 at 12:11
22

I wrote the answer on CV that you refer to, here's a bit more detail:

  1. seq(2, 100, by =1) simply creates a number sequence from 2 to 100 by ones, so 2, 3, 4, 5, ... 100. Those are the numbers of topics that I want to use in the models. One model with 2 topics, another with 3 topics, another with 4 topics and so on to 100 topics.

  2. AssociatedPress[21:30] is simply a subset of the built-in data in the topicmodels package. I just used a subset in that example so that it would run faster.

Regarding the general question of optimal topic numbers, I now follow the example of Martin Ponweiser on Model Selection by Harmonic Mean (4.3.3 in his thesis, which is here: http://epub.wu.ac.at/3558/1/main.pdf). Here's how I do it at the moment:

library(topicmodels)
#
# get some of the example data that's bundled with the package
#
data("AssociatedPress", package = "topicmodels")

harmonicMean <- function(logLikelihoods, precision=2000L) {
  library("Rmpfr")
  llMed <- median(logLikelihoods)
  as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
                                       prec = precision) + llMed))))
}

# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50

fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )

# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:

logLiks <- fitted@logLiks[-c(1:(burnin/keep))]

# assuming that burnin is a multiple of keep and

 harmonicMean(logLiks)

So to do this over a sequence of topic models with different numbers of topics...

# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))

# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L)  L@logLiks[-c(1:(burnin/keep))])

# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))

# inspect
plot(sequ, hm_many, type = "l")

# compute optimum number of topics
sequ[which.max(hm_many)]
## 6

enter image description here Here's the output, with numbers of topics along the x-axis, indicating that 6 topics is optimum.

Cross-validation of topic models is pretty well documented in the docs that come with the package, see here for example: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf Give that a try and then come back with a more specific question about coding CV with topic models.

MattBagg
  • 10,268
  • 3
  • 40
  • 47
Ben
  • 41,615
  • 18
  • 132
  • 227
  • 1
    Thank you for your answer. In Grun and Hornik's paper, they said that we can do cross validation but did not mention exactly how? My main question, now, is about your implementation of perplexity() or logLik(). Grun paper mentions that "perplexity() can be used to determine the perplexity of a fitted model also for new data" Ok, this is what I want to do. – user37874 Feb 06 '14 at 21:20
  • I want to run LDA with 180 docs (training set) and check perplexity on 20 docs (hold out set). Your implementation is like this with my modification: best.model <- lapply(seq(2,100, by=1), function(k){LDA(dtm[1:20,], k)}) best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik))) best.model.logLik.df <- data.frame(topics=c(2:100), LL=as.numeric(as.matrix(best.model.logLik))) I do not see any clue that you run LDA with test set first and calculated the perplexity on the hold out set given the LDA result from the test set. – user37874 Feb 06 '14 at 21:31
  • AM I UNDERSTANDING SOMETHING WRONG about perplexity function? – user37874 Feb 06 '14 at 21:33
  • Could you rewrite this code in Python? That would be totally awesome since I think there are more Pythoneers (like me) out there than R users :) – LittleBobbyTables Jul 30 '14 at 19:33
  • 3
    Sorry, I'm an R monoglot, can't help with python! – Ben Jul 30 '14 at 21:09
  • What's the role of burnin and keep here – neel Sep 03 '14 at 06:51
  • for a small document, number of topics are varying from 15 to 50? How to interprete it? – neel Sep 03 '14 at 09:19
  • Great answer and explanation @Ben. Just one additional question. Depending on the dirichlet hyperparamenters of the LDA model: alpha (topic proportions) and etha (topic multinomials), the optimal number of topics will change. For example higher values of ethas can expect to increase the number of topics (Griffiths and Steyvers 2004). What are the values of these two hyperparamenters assumed here? Thanks – Economist_Ayahuasca Oct 19 '16 at 10:19
  • This answer is good for what it does, but doesn't actually address cross-validation, or using perplexity on a new validation set. – Peter Ellis Jan 04 '17 at 21:48
  • Very useful answer with a great explanation. Only one doubt, why are the values of the log likelihood negative? If I understood it correctly, the log likelihood is log P(w|T), where T are the topics. Should not then be positive since we are dealing with probabilities? – Economist_Ayahuasca Jan 23 '17 at 09:04
  • 1
    Lately I have been using this pkg to find the optimum number of topics: https://cran.r-project.org/web/packages/ldatuning – Ben Feb 04 '18 at 18:21