Calculate cosine and jaccard similarity of collection of document in r

Question

I am going to calculate similarity between almost 14 thousand documents. But code is taking too much time for execution. Is there any other way to do same work faster?

Here is my code

wb=createWorkbook() #create workbook
addWorksheet(wb,"absSim") #create worksheet
listoffiles=list.files() #get list of documents from current working directory
fileslength=length(listoffiles) #no of documents in directory
for(i in 1:fileslength-1)
{
  d1=readLines(listoffiles[i])# read first document 
  k=i+1
  for(j in k:fileslength)
  {
   d2=readLines(listoffiles[j]) #read second document
   #make a vector of two documents
   myvector=c(d1,d2)
   #making corpus of two documents
   mycorpus=Corpus(VectorSource(myvector))
   #preprocessing of corpus
   mycorpus=tm_map(mycorpus,removePunctuation)
   mycorpus=tm_map(mycorpus,removeNumbers)
   mycorpus=tm_map(mycorpus,stripWhitespace)
   mycorpus=tm_map(mycorpus,tolower)
   mycorpus=tm_map(mycorpus,function(x) removeWords(x,stopwords("english")))
   mycorpus=tm_map(mycorpus,function(x) removeWords(x,"x"))
   #make a document term matrix now
   dtm=as.matrix(DocumentTermMatrix(mycorpus))
   #compute distance of both documents using proxy package
   cdist=as.matrix(dist(dtm,method = "cosine"))
   jdist=as.matrix(dist(dtm,method = "jaccard"))
   #compute similarity
   csim=1-cdist
   jsim=1-jdist
   #get similarity of both documents
   cos=csim[1,2]
   jac=jsim[1,2]
   if(cos>0 | jac>0)
   {
     writeData(wb,"absSim",cos,startCol = 1,startRow = rownum)
     writeData(wb,"absSim",jac,startCol = 2,startRow = rownum)
     saveWorkbook(wb,"abstractSimilarity.xlsx",overwrite = TRUE)
     rownum=rownum+1
   }
  }
}

When I run this code, the first document executed in 2 hr. Is there any idea to calculate cosine and jaccard similarity faster?

Have a look at the package `text2vec` which is currently the fastest for this kind of tasks, at least to my experience. Several good tutorials are available at [text2vec.org](http://text2vec.org/). Furthermore, your way of reading in files seems a bit complicated to me. You might consider using `list.files()` and read in all documents at once, e.g. via `lapply`, and then construct the corpus and dtm from this object. Does that help? — Manuel Bickel, Nov 23 '17 at 10:05
i have huge collection of documents, and it will require more space. so make a dtm is very complex for large collection. — Alvi, Nov 23 '17 at 11:45
What does huge mean for your? How many docs, how many terms per doc in average? — Manuel Bickel, Nov 23 '17 at 11:47
i have 14,000 documents. every document contain almost 80 plus words. and i want to calculate similarity between every piece of document. Is there any code example ?? — Alvi, Nov 23 '17 at 11:57

Manuel Bickel · Answer 1 · 2017-11-23T19:32:32.783

You might try the following code. It is a very simplified version without any cleaning or pruning just to demonstrate how to use text2vec. I have also used the tokenizers package for tokenization, since its a bit faster than the tokenizer in text2vec. I used the sampling function that was provided by Zach for this question/answer. On my machine it completes in less than a minute. Of course, other similarity measures or integration of pre-processing are possible. I hope this is what you are looking for.

library(text2vec)
library(tokenizers)

samplefun <- function(n, x, collapse){
  paste(sample(x, n, replace=TRUE), collapse=collapse)
}

words <- sapply(rpois(10000, 8) + 1, samplefun, letters, '')

#14000 documents, each with 100 lines (pasted together) of several words
docs <- sapply(1:14000, function(x) {

  paste(sapply(rpois(100, 5) + 1, samplefun, words, ' '), collapse = ". ")

})

iterator <- itoken(docs,
                   ,tokenizer = function(x) tokenizers::tokenize_words(x, lowercase = FALSE)
                   ,progressbar = FALSE
                   )

vocabulary <- create_vocabulary(iterator)

dtm <- create_dtm(iterator, vocab_vectorizer(vocabulary))

#dtm
#14000 x 10000 sparse Matrix of class "dgCMatrix"
#....

#use, e.g., the first and second half of the dtm as document sets
similarity <- sim2(dtm[1:(nrow(dtm)/2),]
                   , dtm[(nrow(dtm)/2+1):nrow(dtm),]
                   , method = "jaccard"
                   , norm = "none")

dim(similarity)
#[1] 7000 7000

Calculate cosine and jaccard similarity of collection of document in r

1 Answers1