16

I want to calculate the cosine distance among authors of a corpus. Let's take a corpus of 20 documents.

require(tm)
data("crude")
length(crude)
# [1] 20

I want to find out the cosine distance (similarity) among these 20 documents. I create a term-document matrix with

tdm <- TermDocumentMatrix(crude,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

then I have to convert it to a matrix to pass it to dist() of the proxy package

tdm <- as.matrix(tdm)
require(proxy)
cosine_dist_mat <- as.matrix(dist(t(tdm), method = "cosine"))

Finally I remove the diagonal of my cosine distance matrix (since I am not interested in the distance between a document and itself) and compute the average distance between each document and the other 19 document of the corpus

diag(cosine_dist_mat) <- NA
cosine_dist <- apply(cosine_dist_mat, 2, mean, na.rm=TRUE)

cosine_dist
# 127       144       191       194 
# 0.6728505 0.6788326 0.7808791 0.8003223 
# 211       236       237       242 
# 0.8218699 0.6702084 0.8752164 0.7553570 
# 246       248       273       349 
# 0.8205872 0.6495110 0.7064158 0.7494145 
# 352       353       368       489 
# 0.6972964 0.7134836 0.8352642 0.7214411 
# 502       543       704       708 
# 0.7294907 0.7170188 0.8522494 0.8726240

So far so good (with small corpora). The problem is that this method doesn't scale well for larger corpora of documents. For once it seems inefficient because of the two calls to as.matrix(), to pass the tdm from tm to proxy and finally to calculate the average.

Is it possible to conceive a smarter way to obtain the same result?

CptNemo
  • 6,455
  • 16
  • 58
  • 107
  • `colMeans` is probably faster than `apply`. However, you should `Rprof` the call to see where it spends most of the time. It may well be the `dist` call, in which case there isn't much you can do. – James Apr 20 '15 at 14:49
  • You'd think that the 'tm' library would have this built-in... – wordsforthewise Nov 03 '17 at 23:01

2 Answers2

15

Since tm's term document matrices are just sparse "simple triplet matrices" from the slam package, you could use the functions there to calculate the distances directly from the definition of cosine similarity:

library(slam)
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))

This takes advantage of sparse matrix multiplication. In my hands, a tdm with 2963 terms in 220 documents and 97% sparsity took barely a couple of seconds.

I haven't profiled this, so I have no idea if it's any faster than proxy::dist().

NOTE: for this to work, you should not coerce the tdm into a regular matrix, i.e don't do tdm <- as.matrix(tdm).

NumerousHats
  • 636
  • 5
  • 14
  • Indeed, using `slam` is the right way to proceed. It makes the execution time manageable also with big (10,000+ documents) corpora. – CptNemo Apr 21 '15 at 08:07
10

First. Great code MAndrecPhD! But I believe he meant to write:

cosine_dist_mat <- crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))

His code as written returns the dissimilarity score. We want 1's on the diagonal for cosine similarity, not 0's. https://en.wikipedia.org/wiki/Cosine_similarity. I could be mistaken, and you guys actually want the dissimilarity score, but I thought I'd mention it, since it took me a little thought to sort through.

Luke Gallione
  • 101
  • 2
  • 3