0

I have a large number of documents and I want to do topic modelling using text2vec and LDA (Gibbs Sampling).

Steps I need are as (in order):

  1. Removing numbers and symbols from the text

    library(stringr)
    docs$text <- stringr::str_replace_all(docs$text,"[^[:alpha:]]", " ")
    docs$text <- stringr::str_replace_all(docs$text,"\\s+", " ")
    
  2. Removing stop words

    library(text2vec)        
    library(tm)
    
    stopwords <- c(tm::stopwords("english"),custom_stopwords)
    
    prep_fun <- tolower
    tok_fun <- word_tokenizer
    tok_fun <- word_tokenizer    
    tokens <- docs$text%>% 
             prep_fun %>% 
             tok_fun
    it <- itoken(tokens, 
                ids = docs$id,
                progressbar = FALSE)
    
    v <- create_vocabulary(it, stopwords = stopwords) %>% 
        prune_vocabulary(term_count_min = 10)
    
    vectorizer <- vocab_vectorizer(v)
    
  3. Replacing synonyms by terms

I have an excel file in which first column is the main word and synonyms are listed in second, third and ... columns. I want to replace all synonyms by main words (column #1). Each term can have different number of synonyms. Here is an example of code using "tm" package (but I am interested to the one in text2vec package):

replaceSynonyms <- content_transformer(function(x, syn=NULL)
       {Reduce(function(a,b) {
       gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word,     a, perl = TRUE)}, syn, x)  })

 l <- lapply(as.data.frame(t(Synonyms), stringsAsFactors = FALSE), #
          function(x) { 
            x <- unname(x) 
            list(word = x[1], syns = x[-1])
          })
names(l) <- paste0("list", Synonyms[, 1])
list2env(l, envir = .GlobalEnv)

synonyms <- list()        
for (i in 1:length(names(l))) synonyms[i] = l[i]

MyCorpus <- tm_map(MyCorpus, replaceSynonyms, synonyms)
  1. Convert to document term matrix

    dtm  <- create_dtm(it, vectorizer)
    
  2. Apply LDA model on document term matrix

    doc_topic_prior <- 0.1  # can be chosen based on data? 
    lda_model <- LDA$new(n_topics = 10, 
              doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01)
    doc_topic_distr <- lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol <- 0.01, check_convergence_every_n = 10)
    

MyCorpurs in Step 3 is the corpus obtained using "tm" package. Step 2 and Step 3 do not work together as the output of Step 2 is vocab but the input for Step 3 is a "tm" corpus.

My first question, here, is that how can I do all steps using text2vec package (and compatible packages) as I found it very efficient; thanks to Dmitriy Selivanov.

Second: how we set optimal values for parameters in LDA in Step 5? Is it possible to set them automatically based on data?

Thanks to Manuel Bickel for corrections in my post.

Thanks, Sam

Sam S.
  • 627
  • 1
  • 7
  • 23

1 Answers1

3

Updated answer in response to your comment:

First question: The issue of synonym replacement has already been answered here: Replace words in text2vec efficiently. Check the answer of count in partiular. Patterns and replacements may be ngrams (multi word phrases). Please note that the second answer of Dmitriy Selivanov uses word_tokenizer() and does not cover the case of ngram replacement in the form presented.

Is there any reason why you need to replace synonyms before stopword removal? Usually this order should not cause problems; or do you have an example in which switching the order produces significanlty different results? If you really want to replace synonyms after stopword removal, I guess, that you would have to apply such changes to the dtm when using text2vec. If you do so, you need to allow ngrams in your dtm with a minimum ngram length as included in your synonyms. I have provided a workaround in below code as one option. Please note, that allowing higher ngrams in your dtm produces noise that may or may not influence your downstream tasks (you can probably prune most of the noise in the vocabulary step). Therefore, replacing ngrams in earlier seems to be a better solution.

Second question: You might check the package (and the source code) of the textmineR package which helps you to select the best number of topics or also the answer to this question Topic models: cross validation with loglikelihood or perplexity. Regarding handling of priors I have not figured out yet, how different packages, e.g., text2vec (WarpLDA algorithm), lda (Collaped Gibbs Sampling algorithm and others), or topicmodels ('standard' Gibbs Sampling and Variational Expectation-Maximization algorithm) handle these values in detail. As a starting point, you might have a look at the detailed documentation of topicmodels, chapter "2.2. Estimation" tells you how the alpha and beta parameters are estimated that are defined in "2.1 Model specification".

For the purpose of learning, please note that your code produced errors at two points, which I have revised: (1) you need to use the correct name for stopwords in create_vocabulary(), stopwords instead of stop_words, since you defined the name as such (2) you do not need vocabulary =... in your lda model definition - maybe you use an older version of text2vec?

library(text2vec) 
library(reshape2)
library(stringi)

#function proposed by @count
mgsub <- function(pattern,replacement,x) {
  if (length(pattern) != length(replacement)){
    stop("Pattern not equal to Replacment")
  } 
  for (v in 1:length(pattern)) {
    x  <- gsub(pattern[v],replacement[v],x, perl = TRUE)
  }
  return(x )
}

docs <- c("the coffee is warm",
          "the coffee is cold",
          "the coffee is hot",
          "the coffee is boiling like lava",
          "the coffee is frozen",
          "the coffee is perfect",
          "the coffee is warm almost hot"
)

synonyms <- data.frame(mainword = c("warm", "cold")
                       ,syn1 = c("hot", "frozen")
                       ,syn2 = c("boiling like lava", "")
                       ,stringsAsFactors = FALSE)

synonyms[synonyms == ""] <- NA

synonyms <- reshape2::melt(synonyms
                           ,id.vars = "mainword"
                           ,value.name = "synonym"
                           ,na.rm = TRUE)

synonyms <- synonyms[, c("mainword", "synonym")]


prep_fun <- tolower
tok_fun <- word_tokenizer
tokens <- docs %>% 
  #here is where you might replace synonyms directly in the docs
  #{ mgsub(synonyms[,"synonym"], synonyms[,"mainword"], . ) } %>%
  prep_fun %>% 
  tok_fun
it <- itoken(tokens, 
             progressbar = FALSE)

v <- create_vocabulary(it,
                       sep_ngram = "_",
                       ngram = c(ngram_min = 1L
                                 #allow for ngrams in dtm
                                 ,ngram_max = max(stri_count_fixed(unlist(synonyms), " "))
                                 )
)

vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it, vectorizer)

#ngrams in dtm
colnames(dtm)

#ensure that ngrams in synonym replacement table have the same format as ngrams in dtm
synonyms <- apply(synonyms, 2, function(x) gsub(" ", "_", x))

colnames(dtm) <- mgsub(synonyms[,"synonym"], synonyms[,"mainword"], colnames(dtm))


#only zeros/ones in dtm since none of the docs specified in my example
#contains duplicate terms
dim(dtm)
#7 24
max(dtm)
#1

#workaround to aggregate colnames in dtm
#I think there is no function `colsum` that allows grouping
#therefore, a workaround based on rowsum
#not elegant because you have to transpose two times, 
#convert to matrix and reconvert to sparse matrix
dtm <- 
  Matrix::Matrix(
    t(
      rowsum(t(as.matrix(dtm)), group = colnames(dtm))
    )
    , sparse = T)


#synonyms in columns replaced
dim(dtm)
#7 20
max(dtm)
#2
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • Thanks so much for your answer. Actually I have a large data with lots of misspelling and abbreviations; also different abbreviations for the same word. The main word is just one word but synonyms can be phrases such as "hot water". I need to first remove stopwords (Step 2 in my Question) and, then, replace multiple synonyms with the main word. How can I do those two steps in order, i.e. first removing stop words, then, replacing synonyms? I did all my work using "tm" and "topicmodels" packages but they are very slow and I want to switch to text2vec. – Sam S. Oct 22 '17 at 11:22
  • I realized that part of your question has been answered already elsewhere. I have updated my answer accordingly and included a link to that answer. – Manuel Bickel Oct 23 '17 at 12:15
  • Thanks Manuel for the update. Removing some stopwords before ngrams meke it easier for me to focus on important ngrams/phrases. For example, "return to work", "return at work", "return work" are all replaced with returnwork. I have lots of this type of phrases. – Sam S. Oct 25 '17 at 05:29
  • 1
    I can use your code to remove stopwords before replacing synonyms; great code. I added {mgsub(stopwords[,"stopword"], stopwords[,"spacereplace"], . )} %>% exactly before synonyms replacement in your function. Here "stopword" is my stopword and spacereplace = " "; I used exactly in the same way you used for synonyms. Also for the entire word replacement I used x <- gsub(paste0("(?<!\\p{L})(?:", paste(pattern[v], collapse = "|"), ")(?!\\p{L})"),replacement[v],x, perl = TRUE) in your function; it prevent creating some unwanted words if you replace something like "r" with "right". – Sam S. Oct 25 '17 at 05:39
  • I am glad that I could help. Thanks for your clarification regarding order of replacement. Your order makes sense, of course. Please consider accepting my answer if I could solve your issues, this would mark this thread as completed. Thanks and good luck with your project. – Manuel Bickel Oct 25 '17 at 06:57
  • Forgot something: Your approach of stopword removal in the toknizer/prep_fun does not produce the noise I pointed out, but does not use the the speed advantage of `text2vec`. None of the approaches is better per se, just wanted to point out the issue that one has to choose between speed and accuracy depending on the amount of text to be processed. Of course, if you have ngrams that include stopwords, I would opt for your approach if speed is not an issue, due to its accuracy. In my answer I just forced the use of `text2vec` stopword remocal to demonstrate the approach. Hope it helped ;-) – Manuel Bickel Oct 25 '17 at 07:15