I am trying to fetch the tweets for one of the keyword let's say"zomato" and trying to do topic modelling on the tweets fetched. Following is the search function to fetch tweets.
search <- function(searchterm)
{
#access tweets and create cumulative file
list <- searchTwitter(searchterm, n=25000)
df <- twListToDF(list)
df <- df[, order(names(df))]
df$created <- strftime(df$created, '%Y-%m-%d')
if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F)
#merge last access with cumulative file and remove duplicates
stack <- read.csv(file=paste(searchterm, '_stack.csv'))
stack <- rbind(stack, df)
stack <- subset(stack, !duplicated(stack$text))
return(stack)
}
ZomatoResults<- search('Zomato')
Post this I do the cleaning of tweets which is done normally and stored in variable "ZomatoCleaned". I havent added that piece of code. And then I form the corpus to do the topic modelling as shown below
options(mc.cores = 1) # or whatever
tm_parLapply_engine(parallel::mclapply)
corpus <- Corpus(VectorSource(ZomatoCleaned)) # Create corpus object
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
pal <- brewer.pal(8, "Dark2")
dev.new(width = 1000, height = 1000, unit = "px")
wordcloud(corpus, min.freq=2, max.words = 100, random.order = TRUE, col = pal)
dat <- DocumentTermMatrix(corpus)
dput(head(dat))
doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))
dtm <- DocumentTermMatrix(corpus[doc.lengths > 0])
# model <- LDA(dtm, 10) # Go ahead and test a simple model if you want
SEED = sample(1:1000000, 1) # Pick a random seed for replication
k = 10 # Let's start with 10 topics
models <- list(
CTM = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3))),
VEM = LDA(dtm, k = k, control = list(seed = SEED)),
VEM_Fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000,
thin = 100, iter = 1000))
)
lapply(models, terms, 10)
assignments <- sapply(models, topics)
head(assignments, n=10)
Unfortunately in
doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))
I am getting the error "Vector Size Specified is too large in R" or "cannot allocate vector of size 36.6Gb". I am using 8Gb Ram System and Rstudio 3.5.2 I have run gc() command and tried to set memory.limit() also but no help . Is there some workaround to deal with this dataset? I know it's memory issue but please help on this on how to tackle this scenario
O/P of dat: structure(c(0, 1, 0, 0, 0, 0), weighting = c("term frequency", "tf"), class = c("DocumentTermMatrix", "simple_triplet_matrix"))