6

Bear with me as I am extremely new to this and working on a project for a course in a certificate program.

I have .csv dataset that I obtained by retrieving bibliometric records from Pubmed and Embase databases. There are 1034 rows. There are several columns, however, I am trying to create topic models from just one column, the Abstract column and some records do not have an abstract. I've done some processing (removing stopwords, punctuation, etc.) and have been able to barplot words occurring more than 200 times as well as create a Frequent Term list by rank and can also run word associations with selected words. So, it seems r is seeing the words themselves in the Abstract field. My issue comes when I try to create topic models using the topicmodels package. Here's the bit of code I'm using.

#including 1st 3 lines for reference
options(header = FALSE, stringsAsFactors = FALSE, FileEncoding = 
"latin1")
records <- read.csv("Combined.csv")
AbstractCorpus <- Corpus(VectorSource(records$Abstract))

AbstractTDM <- TermDocumentMatrix(AbstractCorpus)
library(topicmodels)
library(lda)
lda <- LDA(AbstractTDM, k = 8)
(term <- terms(lda, 6))
term <- (apply(term, MARGIN = 2, paste, collapse = ","))

However, the output of topics I get is the following.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8

[1,] "499"   "733"   "390"   "833"   "17"    "413"   "719"   "392"  
[2,] "484"   "655"   "808"   "412"   "550"   "881"   "721"   "61"   
[3,] "857"   "299"   "878"   "909"   "15"    "258"   "47"    "164"  
[4,] "491"   "672"   "313"   "1028"  "126"   "55"    "375"   "987"  
[5,] "734"   "430"   "405"   "102"   "13"    "193"   "83"    "588"  
[6,] "403"   "52"    "489"   "10"    "598"   "52"    "933"   "980"  

Why am I not seeing words here rather than numbers?

Furthermore, the following code, which I basically took from the r PDF on topicmodels, does produce values for me, but the topics are still numbers rather than words, and this is meaningless to me.

#using information from topicmodels paper
library(tm)
library(topicmodels)
library(lda)
AbstractTM <- list(VEM = LDA(AbstractTDM, k = 10, control = list(seed =    
505)), VEM_fixed = LDA(AbstractTDM, k = 10, control = list(estimate.alpha 
= FALSE, seed = 505)), Gibbs = LDA(AbstractTDM, k = 10, method = "Gibbs", 
Control = list(seed = 505, burnin = 100, thin = 10, iter = 100)), CTM = 
CTM(AbstractTDM, k = 10, control = list(seed = 505, var = list(tol = 
10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the α values of the    
models fitted with VEM and α estimated and with VEM and α fixed 

sapply(AbstractTM[1:2], slot, "alpha")

#Find entropy 
sapply(AbstractTM, function(x)mean(apply(posterior(x)$topics, 1, 
function(z) - sum(z * log(z)))))

#Find estimated topics and terms
Topic <- topics(AbstractTM[["VEM"]], 1)
Topic
#find 5 most frequent terms for each topic
Terms <- terms(AbstractTM[["VEM"]], 5)
Terms[,1:5]

Any thoughts on what the issue might be?

SciLibby
  • 63
  • 2
  • Please provide [reproducible examples](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) when you're asking a question. – Adam Quek Apr 17 '17 at 02:42
  • It's hard to tell without a reproducible example, but I suspect you are getting the documents in place of the terms. Have you tried using `DocumentTermMatrix()` instead of `TermDocumentMatrix()`? – Kara Woo Apr 17 '17 at 04:40
  • Yes, I have tried that. For some reason, it produces a matrix with zero terms, thus I can't do anything with it. When I attempt to plot word frequencies with the TDM, I get a barplot of numbers instead of terms. What would you need for this to be reproducible? Again, I note I'm new to this so I don't fully understand. Do you mean ALL the code or more than that? – SciLibby Apr 17 '17 at 04:56
  • You can see more info on reproducible examples at the link Adam included, but the most important thing is a sample of the data. That way we can run the exact same code as you and see the results. Can you include the output of `dput(head(records$Abstract, 10))`, or a toy dataset that produces the same problem? – Kara Woo Apr 17 '17 at 05:13
  • 1
    Also some rows of the DTM will be all zeroes because some of the records don't have abstracts. These are probably represented in the Abstracts column as empty strings, and since they don't have terms all the term frequencies for these documents are zeroes. I would subset your data to remove the empty abstracts like this: `abstracts <- records$Abstract[records$Abstract != ""]`, then create the corpus and DTM. – Kara Woo Apr 17 '17 at 05:26

1 Answers1

5

Reading the topicmodels documentation, it does appear that the LDA() function expects a DocumentTermMatrix, not a TermDocumentMatrix. Try creating the former with DocumentTermMatrix(AbstractCorpus) and see if that works.

Kara Woo
  • 3,595
  • 19
  • 31