4

I tried to write a similarity matrix by using cosine similarity, and I used nested loop. I know that nested loops are not always idiomatic in R, and this implementation takes a lot of time to execute.

I am wondering how can I convert this code to a code without nested loop.

 cosine.sim <- function(data) 
{
        data <- t(data)
        cos.sim <- matrix (data = 1, nrow = ncol(data), ncol = ncol(data))
        for(i in 1:(ncol(data)-1))
        {
                for(j in (i+1):ncol(data))
                {
                        A <- sqrt ( sum (data[,i] ^2) )
                        B <- sqrt ( sum (data[,j] ^2) )
                        C <- sum ( data[,i] * data[,j] ) 
                        cos.sim [i,j] <- C / (A * B)
                        cos.sim [j,i] <- C / (A * B)
                }
        }
        return (cos.sim)
}
zx8754
  • 52,746
  • 12
  • 114
  • 209
Sahar
  • 177
  • 2
  • 11
  • 5
    Loops in R are fine, as long as you pre-allocate all the necessary objects. Your example is not reproducible. Paste some data and show us what the output would look like. It would also help if you describe in words or pseudo-algorithm what and how you're trying to achieve your goal. – Roman Luštrik Feb 11 '16 at 14:51
  • Also, it is a good idea to avoid using functions as variable names, like `data`. – MikeJewski Feb 11 '16 at 14:58
  • 2
    if you want to calculate the correlation coefficient, then you don't need to write such program, simply use the existed ones, for example library(Hmisc) rcorr(x, type="pearson") or check for other types –  Feb 11 '16 at 15:00
  • @RomanLuštrik my data contains term document matrix for 1500 documents. My function works properly but it takes a lot of time to execute. It returns document/ document similarity of these documents. – Sahar Feb 11 '16 at 15:06
  • 1
    I would move the A computation out of nested loop (as it gives the same result for each j), and maybe test if i and j are the same to avoid setting the diagonal twice (but unsure this would be quicker). – Tensibai Feb 11 '16 at 15:07
  • 1
    did you try a package that already do the cosine distance in a full matrix? here is a post were you can find a discussion about it . – Carlos Alberto Feb 11 '16 at 15:31

1 Answers1

4

Using the low-level cross product function should be orders of magnitude faster than doing the same within R.

Example data

> set.seed(1)
> (data<-matrix(runif(30),5,6))
          [,1]       [,2]      [,3]      [,4]      [,5]       [,6]
[1,] 0.2655087 0.89838968 0.2059746 0.4976992 0.9347052 0.38611409
[2,] 0.3721239 0.94467527 0.1765568 0.7176185 0.2121425 0.01339033
[3,] 0.5728534 0.66079779 0.6870228 0.9919061 0.6516738 0.38238796
[4,] 0.9082078 0.62911404 0.3841037 0.3800352 0.1255551 0.86969085
[5,] 0.2016819 0.06178627 0.7698414 0.7774452 0.2672207 0.34034900

The following is equivalent

 > tcrossprod(data/sqrt(rowSums(data^2)))
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.8193235 0.8644710 0.6829105 0.5854560
[2,] 0.8193235 1.0000000 0.8523731 0.6810237 0.5835957
[3,] 0.8644710 0.8523731 1.0000000 0.7884536 0.8815997
[4,] 0.6829105 0.6810237 0.7884536 1.0000000 0.6324778
[5,] 0.5854560 0.5835957 0.8815997 0.6324778 1.0000000

but likely much faster than your function

> cosine.sim(data)
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.8193235 0.8644710 0.6829105 0.5854560
[2,] 0.8193235 1.0000000 0.8523731 0.6810237 0.5835957
[3,] 0.8644710 0.8523731 1.0000000 0.7884536 0.8815997
[4,] 0.6829105 0.6810237 0.7884536 1.0000000 0.6324778
[5,] 0.5854560 0.5835957 0.8815997 0.6324778 1.0000000
A. Webb
  • 26,227
  • 1
  • 63
  • 95