0

I am attempting to assign a list of surveyed questions into 30 different categories using the LDA function in the topicmodels package.

The code I have so far is:

source <- VectorSource(openended$q2)
corpus <- Corpus(source)

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = "english")

mat <- DocumentTermMatrix(corpus)
rowTotals <- apply(mat , 1, sum) 
mat <- mat[rowTotals> 0, ]

burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

k <- 30

ldaOut <-LDA(mat,k, method="Gibbs", control=list(nstart=nstart, seed = seed, 
best=best, burnin = burnin, iter = iter, thin=thin))
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))

I already have 10% of the data in openended$q2 appropriately coded, how can I train the algorithm using that data?

Thanks!

DBH
  • 9
  • 3
  • Since `LDA()` as implemented in `topicmodels` is an unsupervised generative algorithm, you cannot train the model in the way you intend to. You can "only" fit a model and check how well a new data fits into this model via `perplexity()`. Regarding a general discussion, you might have a look at this thread [supervised LDA](https://stackoverflow.com/questions/36902758/r-supervised-latent-dirichlet-allocation-package). Further, the package `lda` offers an `slda` model, but I am not familiar with it and I think it is still not what you want. (please anybody correct me if I am wrong) – Manuel Bickel Nov 14 '17 at 12:29
  • Check out the paper called *Labeled LDA* by Ramage. I dont think you can implement it in `R` using `topicmodels`. I am actually working on something similar for Python right now. You can check it out here https://github.com/KenHBS/LDA_thesis it is not really nice to work with for other people, yet, but perhaps it can help pointing you in the right direction. Essentially, you manipulate the prior for theta so that all values are zero, except the element(s) that correspond(s) with the topic you have in your labeled data. – KenHBS Nov 20 '17 at 17:00

0 Answers0