I am trying to use a mallet
topic model with the LDAvis
package. To do so you must extract a number of parameters from the topic.model
object: phi
, theta
, vocab
, doc.length
, and term.frequency
.
The mallet
documentation makes no mention of these parameters. How can I extract them from a topic.model
object generated from data using mallet.import()
and MalletLDA()
?
So far, I've used mallet to fit the topic model:
id_numbers <- as.integer(c(1, 2, 3))
comments <- c("words to be used for text mining", "that may or may not be interesting", "but could serve as a good example")
df <- data.frame(id_numbers, comments, stringsAsFactors = F)
# Set up topic model
library(mallet)
stoplist <- c("to", "be", "or")
write.csv(stoplist, file = "example_stoplist.csv")
mallet.instances <- mallet.import(
as.character(df$id_numbers),
as.character(df$comments),
"example_stoplist.csv",
FALSE,
token.regexp="[\\p{L}']+")
topic.model <- MalletLDA(num.topics=10)
topic.model$loadDocuments(mallet.instances)
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
topic.model$setAlphaOptimization(40, 80) # tweaking optimization interval and burn-in iterations)
topic.model$train(400)
topic.words.m <- mallet.topic.words(topic.model, smoothed=TRUE,
normalized=TRUE)
dim(topic.words.m)
vocabulary <- topic.model$getVocabulary()
colnames(topic.words.m) <- vocabulary
doc.topics.m <- mallet.doc.topics(topic.model, smoothed=T,
normalized=T)
doc.topics.df <- as.data.frame(doc.topics.m)
doc.topics.df <- cbind(id_numbers, doc.topics.df)
doc.topic.means.df <- aggregate(doc.topics.df[, 2:ncol(doc.topics.df)],
list(doc.topics.df[,1]),
mean)
Out of this I now need to generate the JSON
for LDAvis
. I tried the following:
# LDAvis
library(LDAvis)
phi <- t(mallet.topic.words(topic.model, smoothed = TRUE, normalized = TRUE))
phi.count <- mallet.topic.words(topic.model, smoothed = TRUE, normalized = FALSE)
topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)
topic.counts <- rowSums(topic.words)
topic.proportions <- topic.counts/sum(topic.counts)
vocab <- topic.model$getVocabulary()
doc.tokens <- data.frame(id=c(1:nrow(doc.topics.m)), tokens=0)
for(i in vocab){
# Find word if word in text
matched <- grepl(i, df$comments)
doc.tokens[matched,2] =doc.tokens[matched,2] + 1
}
createJSON(phi = phi,
theta = doc.topics.m,
doc.length = doc.tokens,
vocab = vocab,
term.frequency = apply(phi.count, 1, sum))
However, this gives me the following error message:
Error in createJSON(phi = phi, theta = doc.topics.m, doc.length = doc.tokens, :
Number of rows of phi does not match
number of columns of theta; both should be equal to the number of topics
in the model.
So I seem to be generating the phi and theta matrices in the wrong way.