Cosine similarity calculation without nested loops

Question

I tried to write a similarity matrix by using cosine similarity, and I used nested loop. I know that nested loops are not always idiomatic in R, and this implementation takes a lot of time to execute.

I am wondering how can I convert this code to a code without nested loop.

 cosine.sim <- function(data) 
{
        data <- t(data)
        cos.sim <- matrix (data = 1, nrow = ncol(data), ncol = ncol(data))
        for(i in 1:(ncol(data)-1))
        {
                for(j in (i+1):ncol(data))
                {
                        A <- sqrt ( sum (data[,i] ^2) )
                        B <- sqrt ( sum (data[,j] ^2) )
                        C <- sum ( data[,i] * data[,j] ) 
                        cos.sim [i,j] <- C / (A * B)
                        cos.sim [j,i] <- C / (A * B)
                }
        }
        return (cos.sim)
}

Loops in R are fine, as long as you pre-allocate all the necessary objects. Your example is not reproducible. Paste some data and show us what the output would look like. It would also help if you describe in words or pseudo-algorithm what and how you're trying to achieve your goal. — Roman Luštrik, Feb 11 '16 at 14:51
Also, it is a good idea to avoid using functions as variable names, like `data`. — MikeJewski, Feb 11 '16 at 14:58
if you want to calculate the correlation coefficient, then you don't need to write such program, simply use the existed ones, for example library(Hmisc) rcorr(x, type="pearson") or check for other types — , Feb 11 '16 at 15:00
@RomanLuštrik my data contains term document matrix for 1500 documents. My function works properly but it takes a lot of time to execute. It returns document/ document similarity of these documents. — Sahar, Feb 11 '16 at 15:06
I would move the A computation out of nested loop (as it gives the same result for each j), and maybe test if i and j are the same to avoid setting the diagonal twice (but unsure this would be quicker). — Tensibai, Feb 11 '16 at 15:07
did you try a package that already do the cosine distance in a full matrix? here is a post were you can find a discussion about it . — Carlos Alberto, Feb 11 '16 at 15:31

A. Webb · Accepted Answer · 2016-02-12T11:25:04.783

Using the low-level cross product function should be orders of magnitude faster than doing the same within R.

Example data

> set.seed(1)
> (data<-matrix(runif(30),5,6))
          [,1]       [,2]      [,3]      [,4]      [,5]       [,6]
[1,] 0.2655087 0.89838968 0.2059746 0.4976992 0.9347052 0.38611409
[2,] 0.3721239 0.94467527 0.1765568 0.7176185 0.2121425 0.01339033
[3,] 0.5728534 0.66079779 0.6870228 0.9919061 0.6516738 0.38238796
[4,] 0.9082078 0.62911404 0.3841037 0.3800352 0.1255551 0.86969085
[5,] 0.2016819 0.06178627 0.7698414 0.7774452 0.2672207 0.34034900

The following is equivalent

 > tcrossprod(data/sqrt(rowSums(data^2)))
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.8193235 0.8644710 0.6829105 0.5854560
[2,] 0.8193235 1.0000000 0.8523731 0.6810237 0.5835957
[3,] 0.8644710 0.8523731 1.0000000 0.7884536 0.8815997
[4,] 0.6829105 0.6810237 0.7884536 1.0000000 0.6324778
[5,] 0.5854560 0.5835957 0.8815997 0.6324778 1.0000000

but likely much faster than your function

> cosine.sim(data)
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.8193235 0.8644710 0.6829105 0.5854560
[2,] 0.8193235 1.0000000 0.8523731 0.6810237 0.5835957
[3,] 0.8644710 0.8523731 1.0000000 0.7884536 0.8815997
[4,] 0.6829105 0.6810237 0.7884536 1.0000000 0.6324778
[5,] 0.5854560 0.5835957 0.8815997 0.6324778 1.0000000

@A.webb can you explain me why you did tcrosspod(data/sqrt(rowSums(data^2))) instead of tcrossprod(data)/sqrt(rowsums(data^2))? — alily, Oct 28 '16 at 14:56
@alily To match the OP. I don't think the other option is equivalent. — A. Webb, Oct 28 '16 at 15:58

Cosine similarity calculation without nested loops

1 Answers1