LDA & Topic Modelling in R - Topics, Words and Probabilities

Question

I am using the following code to run LDA and get the topics and words associated with the topics.

keythemes <- function(x, stp = NULL){
        suppressPackageStartupMessages(library(lda))
        suppressPackageStartupMessages(library(tm))
        suppressPackageStartupMessages(library(stringr))
        x <- iconv(a$CONTENT,"WINDOWS-1252","UTF-8")
        myCorpus <- Corpus(VectorSource(x))   
        myCorpus <- tm_map(myCorpus, content_transformer(tolower), mc.cores = 1)
        myCorpus <- tm_map(myCorpus, removePunctuation, mc.cores = 1)
        myCorpus <- tm_map(myCorpus, removeNumbers, mc.cores = 1)
        myStopwords <- c(stopwords("english"), stp)
        myCorpus <- tm_map(myCorpus, removeWords, myStopwords, mc.cores = 1)
        s <- tm_map(myCorpus, stemDocument, mc.cores = 1)
        s <- TermDocumentMatrix(myCorpus, control=list(minWordLengths = 3))
        a.tdm.sp <- removeSparseTerms(s, sparse = 0.99)  
        suppressPackageStartupMessages(require(slam))
        a.tdm.sp.t <- t(a.tdm.sp) 
        term_tfidf <- tapply(a.tdm.sp.t$v/row_sums(a.tdm.sp.t)[a.tdm.sp.t$i], a.tdm.sp.t$j,mean) * log2(nDocs(a.tdm.sp.t)/col_sums(a.tdm.sp.t>0)) # calculate tf-idf values
        a.tdm.sp.t.tdif <- a.tdm.sp.t[,term_tfidf>=1.0] 
        a.tdm.sp.t.tdif <- a.tdm.sp.t[row_sums(a.tdm.sp.t) > 0, ]
        suppressPackageStartupMessages(require(topicmodels))
        best.model <- lapply(seq(2, 3, by = 1), function(d){LDA(a.tdm.sp.t.tdif, d)}) 
        best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))  
        best.model.logLik.df <- data.frame(topics=c(2:3), LL = as.numeric(as.matrix(best.model.logLik)))
        best.model.logLik.df.sort <- best.model.logLik.df[order(-best.model.logLik.df$LL), ] 
        ntop <- best.model.logLik.df.sort[1,]$topics
        set.seed(375)
        layout(matrix(c(1, 2), nrow=2), heights=c(1, 6))
        par(mar=rep(0, 4))
        plot.new()
        text(x=0.5, y=0.5, "Key themes based on the key words chosen. \n Themes are populated using Latent Dirichlet Allocation.", cex = 1.2)
        lda <- LDA(a.tdm.sp.t.tdif, ntop) # generate a LDA model the optimum number of topics
        a <- get_terms(lda, 5) # get keywords for each topic, just for a quick look
        a <- data.frame(a)
        suppressPackageStartupMessages(library(gridExtra))
        grid.table(a)

}

How can i get the probability values for each word within a topic and for each topic as well. My desired output is as follows:

Topic 1 Prob.Values  Topic 2 Prop.Values
offer       0.72      women       0.24 
amazon      0.01      shoes       0.06 
footwear    0.04      size        0.02 
flat        0.07      million     0.22

Right now I am getting only the Topic and the Words. I have tried exploring the gamma and beta values while lda@gamma provides the proportionate distribution of each document across various topics, while lda@beta provides me with scores for every word for each number of topics.

I am not sure if beta scores are the actual probability scores or log likelihood scores, because the values go beyond 100 and many has negative scores. A reproducible example of the data is as follows:

structure(list(article_id = c(4.43047e+11, 4.45992e+11, 4.45928e+11, 
4.45692e+11, 4.4574e+11, 4.43754e+11), CONTENT = c("http://www.koovs.com/women/dresses/brand-koovs/sortby-price-low/ Coupon: DRESS50 Validi tii: 17th November Not valid on discounted products.", 
"Jabong has a lot to offer this winter season. So are you ready to click and pick on the all new winter store where all the products you choose are under the budget price of Rs 999 with massive discount of", 
"daughters (Sophia, Sistine and Scarlet) all wore beautiful dresses. 'GMA' Hot List: Jeff Bezos, Sylvester Stallone and a Puppy Party. More. Amazon's Jeff Bezos weights in on making space history and more in today's 60-second hot list. 1:10 | 11/24/15. Share. Title. Description. Share From. Share With. Facebook...", 
"Bags,Wallets and Belts -- AT, wildcrafts & more starting 134 only only on app Main link äóñ http://dl.flipkart.com/dl/bags-wallets-belts/pr... 134 only http://www.flipkart.com/grabbit-men-black-walle...", 
"not revert to a Techcircle.in query till the time of filing this report. Rajan has been the mobile business head of Flipkart-controlled lifestyle e-tailer Myntra since June last year. An alumnus of Delhi College of Engineering and IIM Ahmedabad, Rajan is also the co-founder of Easy2commute.com, a carpooling...", 
NA)), .Names = c("article_id", "CONTENT"), row.names = c(1299L, 
1710L, 1822L, 2371L, 2456L, 2896L), class = "data.frame")

Seems like you're using some code I wrote I while ago. You might be interested to know that I now use a different method for identifying the best model, the details are here: http://stackoverflow.com/a/21394092/1036500 — Ben, Mar 09 '16 at 06:24
@Ben Many thanks Ben...will check it out....On my question, i figured out that I should be using terms along with beta scores..is it safe to assume the beta scores indicate the importance of the term in a topic? You are right! These codes were taken from the chapter you wrote on Data Applications. Good to know the author himself spoken to me!!! — LeArNr, Mar 09 '16 at 06:35

score 0 · Answer 1 · answered Oct 16 '17 at 07:28

@beta is logarithmized word distribution for each topic, so you can convert it to a simple probability distribution by using this code:

Terms.Probability<-10^t(lda@beta)

now the Terms.Probability shows the numbers between 0 to 1 for each term distribution for each topic.

LDA & Topic Modelling in R - Topics, Words and Probabilities

1 Answers1