My task is to compare documents in a corpus by the cosine similarity. I use tm package and obtain the TermDocumentMatrix (in td-idf form) tdm. The following task should as simple as stated in here
d <- dist(tdm, method="cosine")
or
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
However, the number of terms in my tdm is quite large, more than 120,000 (with around 50,000 documents). It is beyond the capability of R to handle such matrix. My RStudio crashed several times.
My questions are 1) how can I handle such a large matrix and get the pair-wise (120,000*120,000) cosine similarity? 2) if impossible, how can I just get the cosine similarity of only two documents at one time? Suppose I want the similarity between document 10 and 21, then something like
sim10_21<-cosine_similarity(tdm, d1=10,d2=21)
If tdm is a simple matrix, I can do the calculate on tdm[,c(10,21)]. However, to convert tdm to a matrix is exactly what I cannot handle. My questions ultimately boils down to how to do matrix-like calculate on tdm.