Possible Duplicate:
Find cosine similarity in R
I have a large table similar to this one in R. I am wanting to find the cosine similarity between each of the items, e.g. the pairs (91, 93), (91, 99), (91, 100) … (101, 125). The final output should be
No_1 No_2 Similarity
...
6518 6763 0.974
…
The table looks like this.
No_ Product.Group.Code R1 R2 R3 R4 S1 S2 S3 U1 U2 U3 U4 U6
91 65418 164 0.68 0.70 0.50 0.59 NA NA 0.96 NA 0.68 NA NA NA
93 57142 164 NA 0.94 NA NA 0.83 NA NA 0.54 NA NA NA NA
99 66740 164 0.68 0.68 0.74 NA 0.63 0.68 0.72 NA NA NA NA NA
100 76712 164 0.54 0.54 0.40 NA 0.39 0.39 0.39 0.50 NA 0.50 NA NA
101 56463 164 0.67 0.67 0.76 NA NA 0.76 0.76 0.54 NA NA NA NA
125 11713 164 NA NA NA NA NA 0.88 NA NA NA NA NA NA
Because some of the rows have NA
, I wrote some helper functions to only compare columns where both of the rows are not NA.
compareNA <- function(v1,v2) {
same <- (!is.na(v1) & !is.na(v2))
same[is.na(same)] <- FALSE
return(same)
}
selectTRUE <- function(v1, truth) {
# This function selects only the variables which correspond to the truth vector
# being true.
for (colname in colnames(v1)) {
if( !truth[ ,colname] ) {
v1[colname] <- NULL
}
}
return(v1)
}
trimAndTuck <- function(v1){
# Turns list into vector and removes first two columns
return (unlist(v1, use.names = FALSE)[-(1:2)])
}
cosineSimilarity <- function(v1, v2) {
truth <- compareNA(v1, v2)
return (cosine(
trimAndTuck(selectTRUE(v1, truth)),
trimAndTuck(selectTRUE(v2, truth))
))
}
allPairs <- function(df){
for ( i in 1:length(df)) {
for (j in 1:length(df)) {
print( cosineSimilarity(df[i,], df[j,]) )
}
}
}
Running allpairs
does give me the correct answer but it does so in a series of 1x1 vectors. I am well aware that what I have written is probably an insult to the functional gods but I wasn't sure how else to write it.
How could this be rewritten (vectorised?) so that it returns data in the right format?
EDIT: I am using the cosine function that is part of the LSA package. This is about handling NA values with the cosine function, not how to calculate standard cosine similarities.