I want to calculate the cosine distance among authors of a corpus. Let's take a corpus of 20 documents.
require(tm)
data("crude")
length(crude)
# [1] 20
I want to find out the cosine distance (similarity) among these 20 documents. I create a term-document matrix with
tdm <- TermDocumentMatrix(crude,
control = list(removePunctuation = TRUE,
stopwords = TRUE))
then I have to convert it to a matrix to pass it to dist()
of the proxy package
tdm <- as.matrix(tdm)
require(proxy)
cosine_dist_mat <- as.matrix(dist(t(tdm), method = "cosine"))
Finally I remove the diagonal of my cosine distance matrix (since I am not interested in the distance between a document and itself) and compute the average distance between each document and the other 19 document of the corpus
diag(cosine_dist_mat) <- NA
cosine_dist <- apply(cosine_dist_mat, 2, mean, na.rm=TRUE)
cosine_dist
# 127 144 191 194
# 0.6728505 0.6788326 0.7808791 0.8003223
# 211 236 237 242
# 0.8218699 0.6702084 0.8752164 0.7553570
# 246 248 273 349
# 0.8205872 0.6495110 0.7064158 0.7494145
# 352 353 368 489
# 0.6972964 0.7134836 0.8352642 0.7214411
# 502 543 704 708
# 0.7294907 0.7170188 0.8522494 0.8726240
So far so good (with small corpora). The problem is that this method doesn't scale well for larger corpora of documents. For once it seems inefficient because of the two calls to as.matrix()
, to pass the tdm
from tm to proxy and finally to calculate the average.
Is it possible to conceive a smarter way to obtain the same result?