1

I have two sparse matrices A and B (slam::simple_triplet_matrix) of the same MxN dimensions, where M = ~100K, N = ~150K.

I wish to calculate the cosine distance between each pair of rows (i.e. row 1 from matrix A and row 1 from matrix B, row 2 from matrix A and row 2 from matrix B, etc.).

I can do this using a for-loop or using apply function but that's too slow, something like:

library(slam)

A <- simple_triplet_matrix(1:3, 1:3, 1:3)
B <- simple_triplet_matrix(1:3, 3:1, 1:3)

cosine <- NULL
for (i in 1:(dim(A)[1])) {
    a <- as.vector(A[i,])
    b <- as.vector(B[i, ])
    cosine[i] <- a %*% b / sqrt(a%*%a * b%*%b)
}

I understand something in this previously asked question might help me, but:

a) This isn't really what I want, I just want M cosine distances for M rows, not all pairwise correlations between rows of a given sparse matrix.

b) I admit to not fully understanding the math behind this 'vectorized' solution so maybe some explanation would come in handy.

Thank you.

EDIT: This is also NOT a duplicate of this question as I'm not just interested in a regular cosine computation on two simple vectors (I clearly know how to do this, see above), I'm interested in a much larger scale situation, specifically involving slam sparse matrices.

Community
  • 1
  • 1
Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72
  • Possible duplicate of [Find cosine similarity in R](http://stackoverflow.com/questions/2535234/find-cosine-similarity-in-r) – Dmitriy Selivanov Sep 06 '16 at 17:04
  • 1
    @DmitriySelivanov Hardly, there's no connection besides asking about cosine distance, and I'm not just asking about cosine distance (which I clearly know how to implement) I'm interested in a large scale situation with sparse matrices. – Giora Simchoni Sep 06 '16 at 17:54

2 Answers2

3

According to the documentation, element-by-element (array) multiplication of compatible simple_triplet_matrices and row_sums of simple_triplet_matrices are available. With these operators/functions, the computation is:

cosineDist <- function(A, B){
  row_sums(A * B) / sqrt(row_sums(A * A) * row_sums(B * B)) 
}

Notes:

  1. row_sums(A * B) computes the dot product of each row in A and its corresponding row in B, which is the numerator term in your cosine. The result is a vector (not sparse) whose elements are these dot products for each corresponding row in A and B.
  2. row_sums(A * A) computes the squared 2-norm of each row in A. The result is a vector (not sparse) whose elements are these squared 2-norms for each row in A.
  3. Similarly, row_sums(B * B) computes the squared 2-norm of each row in B. The result is a vector (not sparse) whose elements are these squared 2-norms for each row in B.
  4. The rest of the computation operates on these vectors whose elements are for each row of A and/or B.
aichao
  • 7,375
  • 3
  • 16
  • 18
0
cosineDist <- function(x){
  as.dist(1 - x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2))))) 
}
Sahil Desai
  • 3,418
  • 4
  • 20
  • 41
  • These are sparse matrices, I cannot just operate `as.matrix` on them. Also this goes beyond what is required, just M distances for M rows. Thanks. – Giora Simchoni Sep 06 '16 at 13:51