2

I am trying to use a mallet topic model with the LDAvis package. To do so you must extract a number of parameters from the topic.model object: phi, theta, vocab, doc.length, and term.frequency.

The mallet documentation makes no mention of these parameters. How can I extract them from a topic.model object generated from data using mallet.import() and MalletLDA()?

So far, I've used mallet to fit the topic model:

id_numbers <- as.integer(c(1, 2, 3))

comments <- c("words to be used for text mining", "that may or may not be interesting", "but could serve as a good example")

df <- data.frame(id_numbers, comments, stringsAsFactors = F)

# Set up topic model
library(mallet)

stoplist <- c("to", "be", "or")
write.csv(stoplist, file = "example_stoplist.csv")

mallet.instances <- mallet.import(
  as.character(df$id_numbers),
  as.character(df$comments),
  "example_stoplist.csv",
  FALSE,
  token.regexp="[\\p{L}']+")

topic.model <- MalletLDA(num.topics=10)
topic.model$loadDocuments(mallet.instances)
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
topic.model$setAlphaOptimization(40, 80) # tweaking optimization interval and burn-in iterations)
topic.model$train(400)

topic.words.m <- mallet.topic.words(topic.model, smoothed=TRUE,
                                normalized=TRUE)
dim(topic.words.m)

vocabulary <- topic.model$getVocabulary() 
colnames(topic.words.m) <- vocabulary 

doc.topics.m <- mallet.doc.topics(topic.model, smoothed=T,
                              normalized=T)


doc.topics.df <- as.data.frame(doc.topics.m)
doc.topics.df <- cbind(id_numbers, doc.topics.df)

doc.topic.means.df <- aggregate(doc.topics.df[, 2:ncol(doc.topics.df)],
                                list(doc.topics.df[,1]),
                                mean)

Out of this I now need to generate the JSON for LDAvis. I tried the following:

# LDAvis
library(LDAvis)
phi <- t(mallet.topic.words(topic.model, smoothed = TRUE, normalized = TRUE))
phi.count <- mallet.topic.words(topic.model, smoothed = TRUE, normalized = FALSE)

topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)
topic.counts <- rowSums(topic.words)

topic.proportions <- topic.counts/sum(topic.counts)

vocab <- topic.model$getVocabulary() 

doc.tokens <- data.frame(id=c(1:nrow(doc.topics.m)), tokens=0)
for(i in vocab){
  # Find word if word in text
  matched <- grepl(i, df$comments)
  doc.tokens[matched,2] =doc.tokens[matched,2] +  1
}

createJSON(phi = phi, 
           theta = doc.topics.m, 
           doc.length = doc.tokens, 
           vocab = vocab, 
           term.frequency = apply(phi.count, 1, sum))

However, this gives me the following error message:

Error in createJSON(phi = phi, theta = doc.topics.m, doc.length = doc.tokens,  : 
  Number of rows of phi does not match 
      number of columns of theta; both should be equal to the number of topics 
      in the model.

So I seem to be generating the phi and theta matrices in the wrong way.

histelheim
  • 4,938
  • 6
  • 33
  • 63
  • 2
    Please attempt to provide some sort of [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data so we can see what you are trying to do and can test possible solutions. – MrFlick Dec 08 '16 at 16:28
  • What does `str` on your mallet object produce? – emilliman5 Dec 08 '16 at 16:32
  • @emilliman5: `str(topic.model)` gives `Formal class 'jobjRef' [package "rJava"] with 2 slots ..@ jobj : ..@ jclass: chr "cc/mallet/topics/RTopicModel"` – histelheim Dec 08 '16 at 16:44
  • I don't think you need to transform `mallet.topic.words` to generate `phi`. Look at the dimensions of `phi`. `theta` and `doc.length` to get them squared away. – emilliman5 Dec 08 '16 at 16:54
  • 1
    If you do not mind using the `lda` package, you have data in the format which is compatible with the LDAvis package. – jazzurro Dec 08 '16 at 16:58

1 Answers1

2

Try removing the matrix transpose function t() from the line where you create phi.

RMallet is returning these matrices in the format expected by LDAvis: topics are columns for document topics (theta) and rows for topic words (phi). Sometimes it makes sense to flip one of them so that either rows or columns always means topics, but not here.

David Mimno
  • 1,836
  • 7
  • 7