4

Possible Duplicate:
Find cosine similarity in R

I have a large table similar to this one in R. I am wanting to find the cosine similarity between each of the items, e.g. the pairs (91, 93), (91, 99), (91, 100) … (101, 125). The final output should be

No_1 No_2 Similarity
...
6518 6763 0.974
…

The table looks like this.

      No_ Product.Group.Code   R1   R2   R3   R4   S1   S2   S3   U1   U2   U3 U4 U6
91  65418                164 0.68 0.70 0.50 0.59   NA   NA 0.96   NA 0.68   NA NA NA
93  57142                164   NA 0.94   NA   NA 0.83   NA   NA 0.54   NA   NA NA NA
99  66740                164 0.68 0.68 0.74   NA 0.63 0.68 0.72   NA   NA   NA NA NA
100 76712                164 0.54 0.54 0.40   NA 0.39 0.39 0.39 0.50   NA 0.50 NA NA
101 56463                164 0.67 0.67 0.76   NA   NA 0.76 0.76 0.54   NA   NA NA NA
125 11713                164   NA   NA   NA   NA   NA 0.88   NA   NA   NA   NA NA NA

Because some of the rows have NA, I wrote some helper functions to only compare columns where both of the rows are not NA.

compareNA <- function(v1,v2) {
    same <- (!is.na(v1) & !is.na(v2))
    same[is.na(same)] <- FALSE
    return(same)
}

selectTRUE <- function(v1, truth) {
    # This function selects only the variables which correspond to the truth vector
    # being true.
    for (colname in colnames(v1)) {
        if( !truth[ ,colname] ) {
            v1[colname] <- NULL
        }
    }
    return(v1)
}

trimAndTuck <- function(v1){
    # Turns list into vector and removes first two columns
    return (unlist(v1, use.names = FALSE)[-(1:2)])
}

cosineSimilarity <- function(v1, v2) {
    truth <- compareNA(v1, v2)
    return (cosine(
                 trimAndTuck(selectTRUE(v1, truth)),
                 trimAndTuck(selectTRUE(v2, truth))
                 ))
}

allPairs <- function(df){
    for ( i in 1:length(df)) {
        for (j in 1:length(df)) {
            print( cosineSimilarity(df[i,], df[j,]) )
        }
    }
}

Running allpairs does give me the correct answer but it does so in a series of 1x1 vectors. I am well aware that what I have written is probably an insult to the functional gods but I wasn't sure how else to write it.

How could this be rewritten (vectorised?) so that it returns data in the right format?

EDIT: I am using the cosine function that is part of the LSA package. This is about handling NA values with the cosine function, not how to calculate standard cosine similarities.

Community
  • 1
  • 1
Daniel Compton
  • 13,878
  • 4
  • 40
  • 60
  • 1
    Is the function `cosine` part of an R package? – Sven Hohenstein Nov 21 '12 at 06:26
  • 1
    Maybe this [so-question](http://stackoverflow.com/questions/2535234/find-cosine-similarity-in-r) (possible duplicated) would help. Follow the instructions of the accepted answer. – sgibb Nov 21 '12 at 07:39
  • This is not a duplicate as it is about how to handle `NA` values when using the `cosine` function. – Roland Nov 21 '12 at 12:10

1 Answers1

4
#data
df <- read.table(text="No_ Product.Group.Code   R1   R2   R3   R4   S1   S2   S3   U1   U2   U3 U4 U6
91  65418                164 0.68 0.70 0.50 0.59   NA   NA 0.96   NA 0.68   NA NA NA
93  57142                164   NA 0.94   NA   NA 0.83   NA   NA 0.54   NA   NA NA NA
99  66740                164 0.68 0.68 0.74   NA 0.63 0.68 0.72   NA   NA   NA NA NA
100 76712                164 0.54 0.54 0.40   NA 0.39 0.39 0.39 0.50   NA 0.50 NA NA
101 56463                164 0.67 0.67 0.76   NA   NA 0.76 0.76 0.54   NA   NA NA NA
125 11713                164   NA   NA   NA   NA   NA 0.88   NA   NA   NA   NA NA NA",header=TRUE)

#remove second column
df <- df[,-2]

#transform to long format
library(reshape2)
df <- melt(df,id.vars="No_")

#cosine similarity taken from package lsa
#I could not load package lsa, because I lack Java on my system
cosine <- function( x, y=NULL ) {

  if ( is.matrix(x) && is.null(y) ) {

    co = array(0,c(ncol(x),ncol(x)))
    f = colnames( x )
    dimnames(co) = list(f,f)

    for (i in 2:ncol(x)) {
      for (j in 1:(i-1)) {
        co[i,j] = cosine(x[,i], x[,j])
      }
    }
    co = co + t(co)
    diag(co) = 1

    return (as.matrix(co))

  } else if ( is.vector(x) && is.vector(y) ) {
    return ( crossprod(x,y) / sqrt( crossprod(x)*crossprod(y) ) )
  } else {
    stop("argument mismatch. Either one matrix or two vectors needed as input.")
  }

}

#define function that removes NA before calculating the similarity
cosine2 <- function(x,y) cosine(na.omit(cbind(x,y)))

#pairwise comparisons
i <- outer(unique(df$No_),unique(df$No_),FUN=function(i,j) i)
j <- outer(unique(df$No_),unique(df$No_),FUN=function(i,j) j)

i <- i[!lower.tri(i)]
j <- j[!lower.tri(j)]

comp <- function(ind) {
  res <- cosine2(df$value[df$No_==i[ind]],df$value[df$No_==j[ind]])[1,2]
  list(No1=as.character(i[ind]),No2=as.character(j[ind]),CosSim=res)
}

res <- as.data.frame(t(sapply(seq_along(i),FUN="comp")))

     No1   No2    CosSim
1  65418 65418         1
2  65418 57142         1
3  57142 57142         1
4  65418 66740 0.9724159
5  57142 66740  0.999714
6  66740 66740         1
7  65418 76712 0.9569313
8  57142 76712 0.9684678
9  66740 76712 0.9854669
10 76712 76712         1
11 65418 56463 0.9741412
12 57142 56463 0.9877108
13 66740 56463 0.9989167
14 76712 56463 0.9708716
15 56463 56463         1
16 65418 11713       NaN
17 57142 11713       NaN
18 66740 11713         1
19 76712 11713         1
20 56463 11713         1
21 11713 11713         1
Roland
  • 127,288
  • 10
  • 191
  • 288