0

Is it possible to check how many documents remain in the corpus after applying prune_vocabulary in the text2vec package?

Here is an example for getting a dataset in and pruning vocabulary

library(text2vec)
library(data.table)
library(tm)

#Load movie review dataset
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)

#Tokenize
prep_fun = tolower
tok_fun = word_tokenizer
it_train = itoken(movie_review$review, 
              preprocessor = prep_fun, 
              tokenizer = tok_fun, 
              ids = movie_review$id, 
              progressbar = FALSE)


#Generate vocabulary
vocab = create_vocabulary(it_train
                      , stopwords = tm::stopwords())

#Prune vocabulary
#How do I ascertain how many documents got kicked out of my training set because of the pruning criteria?
pruned_vocab = prune_vocabulary(vocab, 
                            term_count_min = 10, 
                            doc_proportion_max = 0.5,
                            doc_proportion_min = 0.001)

# create document term matrix with new pruned vocabulary vectorizer
vectorizer = vocab_vectorizer(pruned_vocab)
dtm_train  = create_dtm(it_train, vectorizer)

Is there an easy way to understand how aggressive the term_count_min and doc_proportion_min parameters are being on my text corpus. I am trying to do something similar to how stm package lets us handle this using a plotRemoved function which produces a plot like this:

enter image description here

sriramn
  • 2,338
  • 4
  • 35
  • 45
  • 1
    It's easier to help you if you provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used for testing and verification. – MrFlick Mar 06 '17 at 19:38
  • I have added the MWE. thanks! – sriramn Mar 07 '17 at 00:19

1 Answers1

1

vocab $vocab is a data.table which contains a lot of statistics about your corpus. prune_vocabulary with term_count_min, doc_proportion_min parameters just filters this data.table. For example here is how you can calculate number of removed tokens:

total_tokens = sum(v$vocab$terms_counts)
total_tokens
# 1230342
# now lets prune
v2 = prune_vocabulary(v, term_count_min = 10)
total_tokens - sum(v2$vocab$terms_counts)
# 78037
# effectively this will remove 78037 tokens

On other side you can create document-term matrices with different vocabularies and check different statistics with functions from Matrix package: colMeans(), colSums(), rowMeans(), rowSums(), etc. I'm sure you can obtain any of the metrics above.

For example here is how to find empty documents:

doc_word_count = Matrix::rowSums(dtm)
indices_empty_docs = which(doc_word_count == 0)
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • This is very helpful. I was able to write a wrapper around the `prune_vocabulary` function to generate the token and word plots. Only the documents one is a little confusing. Can you please elaborate a little on how we can calculate how many documents remain in the corpus? One of the things I noticed is that the LDA implementation applies weights for even empty documents. I wonder if that can be avoided and I can manually remove empty document before running LDA using this approach. – sriramn Mar 08 '17 at 21:40
  • Added example on how to find empty docs. For LDA this is probably an old [bug](https://github.com/dselivanov/text2vec/issues/149) which was fixed in development version. – Dmitriy Selivanov Mar 09 '17 at 05:45