I am using the following code to run LDA and get the topics and words associated with the topics.
keythemes <- function(x, stp = NULL){
suppressPackageStartupMessages(library(lda))
suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(stringr))
x <- iconv(a$CONTENT,"WINDOWS-1252","UTF-8")
myCorpus <- Corpus(VectorSource(x))
myCorpus <- tm_map(myCorpus, content_transformer(tolower), mc.cores = 1)
myCorpus <- tm_map(myCorpus, removePunctuation, mc.cores = 1)
myCorpus <- tm_map(myCorpus, removeNumbers, mc.cores = 1)
myStopwords <- c(stopwords("english"), stp)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords, mc.cores = 1)
s <- tm_map(myCorpus, stemDocument, mc.cores = 1)
s <- TermDocumentMatrix(myCorpus, control=list(minWordLengths = 3))
a.tdm.sp <- removeSparseTerms(s, sparse = 0.99)
suppressPackageStartupMessages(require(slam))
a.tdm.sp.t <- t(a.tdm.sp)
term_tfidf <- tapply(a.tdm.sp.t$v/row_sums(a.tdm.sp.t)[a.tdm.sp.t$i], a.tdm.sp.t$j,mean) * log2(nDocs(a.tdm.sp.t)/col_sums(a.tdm.sp.t>0)) # calculate tf-idf values
a.tdm.sp.t.tdif <- a.tdm.sp.t[,term_tfidf>=1.0]
a.tdm.sp.t.tdif <- a.tdm.sp.t[row_sums(a.tdm.sp.t) > 0, ]
suppressPackageStartupMessages(require(topicmodels))
best.model <- lapply(seq(2, 3, by = 1), function(d){LDA(a.tdm.sp.t.tdif, d)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(2:3), LL = as.numeric(as.matrix(best.model.logLik)))
best.model.logLik.df.sort <- best.model.logLik.df[order(-best.model.logLik.df$LL), ]
ntop <- best.model.logLik.df.sort[1,]$topics
set.seed(375)
layout(matrix(c(1, 2), nrow=2), heights=c(1, 6))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Key themes based on the key words chosen. \n Themes are populated using Latent Dirichlet Allocation.", cex = 1.2)
lda <- LDA(a.tdm.sp.t.tdif, ntop) # generate a LDA model the optimum number of topics
a <- get_terms(lda, 5) # get keywords for each topic, just for a quick look
a <- data.frame(a)
suppressPackageStartupMessages(library(gridExtra))
grid.table(a)
}
How can i get the probability values for each word within a topic and for each topic as well. My desired output is as follows:
Topic 1 Prob.Values Topic 2 Prop.Values
offer 0.72 women 0.24
amazon 0.01 shoes 0.06
footwear 0.04 size 0.02
flat 0.07 million 0.22
Right now I am getting only the Topic and the Words. I have tried exploring the gamma and beta values while lda@gamma
provides the proportionate distribution of each document across various topics, while lda@beta
provides me with scores for every word for each number of topics.
I am not sure if beta scores are the actual probability scores or log likelihood scores, because the values go beyond 100 and many has negative scores. A reproducible example of the data is as follows:
structure(list(article_id = c(4.43047e+11, 4.45992e+11, 4.45928e+11,
4.45692e+11, 4.4574e+11, 4.43754e+11), CONTENT = c("http://www.koovs.com/women/dresses/brand-koovs/sortby-price-low/ Coupon: DRESS50 Validi tii: 17th November Not valid on discounted products.",
"Jabong has a lot to offer this winter season. So are you ready to click and pick on the all new winter store where all the products you choose are under the budget price of Rs 999 with massive discount of",
"daughters (Sophia, Sistine and Scarlet) all wore beautiful dresses. 'GMA' Hot List: Jeff Bezos, Sylvester Stallone and a Puppy Party. More. Amazon's Jeff Bezos weights in on making space history and more in today's 60-second hot list. 1:10 | 11/24/15. Share. Title. Description. Share From. Share With. Facebook...",
"Bags,Wallets and Belts -- AT, wildcrafts & more starting 134 only only on app Main link äóñ http://dl.flipkart.com/dl/bags-wallets-belts/pr... 134 only http://www.flipkart.com/grabbit-men-black-walle...",
"not revert to a Techcircle.in query till the time of filing this report. Rajan has been the mobile business head of Flipkart-controlled lifestyle e-tailer Myntra since June last year. An alumnus of Delhi College of Engineering and IIM Ahmedabad, Rajan is also the co-founder of Easy2commute.com, a carpooling...",
NA)), .Names = c("article_id", "CONTENT"), row.names = c(1299L,
1710L, 1822L, 2371L, 2456L, 2896L), class = "data.frame")