I have a large number of documents and I want to do topic modelling using text2vec and LDA (Gibbs Sampling).
Steps I need are as (in order):
Removing numbers and symbols from the text
library(stringr) docs$text <- stringr::str_replace_all(docs$text,"[^[:alpha:]]", " ") docs$text <- stringr::str_replace_all(docs$text,"\\s+", " ")
Removing stop words
library(text2vec) library(tm) stopwords <- c(tm::stopwords("english"),custom_stopwords) prep_fun <- tolower tok_fun <- word_tokenizer tok_fun <- word_tokenizer tokens <- docs$text%>% prep_fun %>% tok_fun it <- itoken(tokens, ids = docs$id, progressbar = FALSE) v <- create_vocabulary(it, stopwords = stopwords) %>% prune_vocabulary(term_count_min = 10) vectorizer <- vocab_vectorizer(v)
Replacing synonyms by terms
I have an excel file in which first column is the main word and synonyms are listed in second, third and ... columns. I want to replace all synonyms by main words (column #1). Each term can have different number of synonyms. Here is an example of code using "tm" package (but I am interested to the one in text2vec package):
replaceSynonyms <- content_transformer(function(x, syn=NULL)
{Reduce(function(a,b) {
gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a, perl = TRUE)}, syn, x) })
l <- lapply(as.data.frame(t(Synonyms), stringsAsFactors = FALSE), #
function(x) {
x <- unname(x)
list(word = x[1], syns = x[-1])
})
names(l) <- paste0("list", Synonyms[, 1])
list2env(l, envir = .GlobalEnv)
synonyms <- list()
for (i in 1:length(names(l))) synonyms[i] = l[i]
MyCorpus <- tm_map(MyCorpus, replaceSynonyms, synonyms)
Convert to document term matrix
dtm <- create_dtm(it, vectorizer)
Apply LDA model on document term matrix
doc_topic_prior <- 0.1 # can be chosen based on data? lda_model <- LDA$new(n_topics = 10, doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01) doc_topic_distr <- lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol <- 0.01, check_convergence_every_n = 10)
MyCorpurs in Step 3 is the corpus obtained using "tm" package. Step 2 and Step 3 do not work together as the output of Step 2 is vocab but the input for Step 3 is a "tm" corpus.
My first question, here, is that how can I do all steps using text2vec package (and compatible packages) as I found it very efficient; thanks to Dmitriy Selivanov.
Second: how we set optimal values for parameters in LDA in Step 5? Is it possible to set them automatically based on data?
Thanks to Manuel Bickel for corrections in my post.
Thanks, Sam