0

I am trying to fetch the tweets for one of the keyword let's say"zomato" and trying to do topic modelling on the tweets fetched. Following is the search function to fetch tweets.

 search <- function(searchterm)
 {
 #access tweets and create cumulative file

 list <- searchTwitter(searchterm, n=25000)
 df <- twListToDF(list)
 df <- df[, order(names(df))]
 df$created <- strftime(df$created, '%Y-%m-%d')
 if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F)
#merge last access with cumulative file and remove duplicates
 stack <- read.csv(file=paste(searchterm, '_stack.csv'))
 stack <- rbind(stack, df)
 stack <- subset(stack, !duplicated(stack$text))

return(stack)

}
ZomatoResults<- search('Zomato') 

Post this I do the cleaning of tweets which is done normally and stored in variable "ZomatoCleaned". I havent added that piece of code. And then I form the corpus to do the topic modelling as shown below

options(mc.cores = 1)  # or whatever
tm_parLapply_engine(parallel::mclapply) 

corpus <- Corpus(VectorSource(ZomatoCleaned))  # Create corpus object
corpus <- tm_map(corpus, removeWords, stopwords("en"))  
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
pal <- brewer.pal(8, "Dark2")
dev.new(width = 1000, height = 1000, unit = "px")
wordcloud(corpus, min.freq=2, max.words = 100, random.order = TRUE, col = pal)
dat <- DocumentTermMatrix(corpus)
dput(head(dat))
doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))
dtm <- DocumentTermMatrix(corpus[doc.lengths > 0])
# model <- LDA(dtm, 10)  # Go ahead and test a simple model if you want

SEED = sample(1:1000000, 1)  # Pick a random seed for replication
k = 10  # Let's start with 10 topics

models <- list(
  CTM       = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3))),
  VEM       = LDA(dtm, k = k, control = list(seed = SEED)),
  VEM_Fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
  Gibbs     = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000,
                                                               thin = 100,    iter = 1000))
)

lapply(models, terms, 10)
assignments <- sapply(models, topics) 
head(assignments, n=10)

Unfortunately in

doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))

I am getting the error "Vector Size Specified is too large in R" or "cannot allocate vector of size 36.6Gb". I am using 8Gb Ram System and Rstudio 3.5.2 I have run gc() command and tried to set memory.limit() also but no help . Is there some workaround to deal with this dataset? I know it's memory issue but please help on this on how to tackle this scenario

Topic Modelling Error Zomato

O/P of dat: structure(c(0, 1, 0, 0, 0, 0), weighting = c("term frequency", "tf"), class = c("DocumentTermMatrix", "simple_triplet_matrix"))

Dat Output Image

  • Are you using a command line interface or in IDE like Rstudio? either way, this post should help, but the solutions are different: https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached/52612921#52612921 – Graeme Frost Apr 02 '19 at 14:04
  • I would suggest working with the dataset in chunks. – Slayer Apr 02 '19 at 14:20
  • It is because whatever `DocumentTermMatrix(corpus)` produces is too large to load for your computer's physical + virtual memory combined for matrix conversion. Can you check the class of this object? Why do you need to convert it to matrix because that is one big memory intensive step. Do you anticipate speed gain by converting it to matrix? There might be other ways to deal with it if we can see some of your data. – cropgen Apr 02 '19 at 14:34
  • @GraemeFrost am using R studio 3.5.2 not the CLI. – Karan Kalra Apr 02 '19 at 15:20
  • @Prerit I also thought of the same but that's my last resort. Wanted to see if something can be done with this dataset only in one go – Karan Kalra Apr 02 '19 at 15:20
  • @nsinghs What other ways can be possible? I just need to get the topics from the dataset of tweets. Nothing more than this. I am doing this for the first time so not sure of possible alternatives. If you can suggest some , please help me on the same – Karan Kalra Apr 02 '19 at 15:20
  • @KaranKalra in that case, look at the answer I posted on the thread I linked in my previous comment, hope it helps! – Graeme Frost Apr 02 '19 at 15:29
  • @KaranKalra, for that I would need to see what does `DocumentTermMatrix(corpus)` look like. Can you assign it to another variable using `dat <- DocumentTermMatrix(corpus)` then run `dput(head(dat))`, then paste the output of that command in your question. I am pretty sure it is a 2-D dataframe or so, but just want to make sure. Then we can try `apply` function, which might not be the fastest but will get the job done without running out of memory. – cropgen Apr 02 '19 at 18:48
  • @nsinghs Have added the same and shared the results – Karan Kalra Apr 02 '19 at 20:32
  • @GraemeFrost I tried but didnt work though the answer is specific to macOS. I am using windows 10 and tried solutions mentioned in other questions though – Karan Kalra Apr 02 '19 at 20:34
  • @KaranKalra that is not informative at all. I expected it to be a 2-D data frame, but it is not clear from the output you provided. Anyway, if you think the `rowSums()` would have worked there if memory was not an issue, you can try `apply()` or `Reduce` to get the row sums. `apply` and `Reduce` are efficient and do not require much memory. – cropgen Apr 02 '19 at 21:04
  • @KaranKalra, ah, well I'm sorry I couldn't help – Graeme Frost Apr 02 '19 at 21:10

0 Answers0