0

I am writing a function to retrieve the top n results from a list of words and their values using cosine similarity. I've included my data as follows, this is the first few entries of ~400k but it gives you an idea of the structure.

the 0.41800  0.249680 -0.41242  0.121700 0.345270 -0.044457 -0.49688 -0.178620 -0.00066023 -0.656600 0.278430 -0.14767 -0.55677  0.14658 -0.0095095
.   0.15164  0.301770 -0.16763  0.176840 0.317190  0.339730 -0.43478 -0.310860 -0.44999000 -0.294860 0.166080  0.11963 -0.41328 -0.42353  0.5986800
of  0.70853  0.570880 -0.47160  0.180480 0.544490  0.726030  0.18157 -0.523930  0.10381000 -0.175660 0.078852 -0.36216 -0.11829 -0.83336  0.1191700
to  0.68047 -0.039263  0.30186 -0.177920 0.429620  0.032246 -0.41376  0.132280 -0.29847000 -0.085253 0.171180  0.22419 -0.10046 -0.43653  0.3341800
and 0.26818  0.143460 -0.27877  0.016257 0.113840  0.699230 -0.51332 -0.473680 -0.33075000 -0.138340 0.270200  0.30938 -0.45012 -0.41270 -0.0993200
in  0.33042  0.249950 -0.60874  0.109230 0.036372  0.151000 -0.55083 -0.074239 -0.09230700 -0.328210 0.095980 -0.82269 -0.36717 -0.67009  0.4290900

Here's the code for my cosine similarity

cosineSim <- function(v1,v2){
 a <- sum(v1*v2)
 b <- sqrt(sum(v1*v1))* sqrt(sum(v2*v2))
return (a/b)
}

I need to take the user vector and compare it to every other vector in the table x, which contains the data set. For example, x['cat',] returns the 50 dimensional vector with all of the values for the word 'cat'.

Here's a sample of what my cosineSim function returns:

cosineSim(x['cat',],x['dog',])

prints the following:

[1] 0.9218005

This represents the cosine similarity of those words.

The values are decimals and this is the first project I've worked on using R so I haven't been able to convert the code here to my needs.

Any help would be greatly appreciated.

Community
  • 1
  • 1
CS2016
  • 331
  • 1
  • 3
  • 15
  • That was just an example, I could have chosen x['the',] and it would have given you the same result as the first line in the structure of the data, I'll update showing the result of cosineSim. – CS2016 Apr 18 '16 at 01:37

1 Answers1

0

r u looking for something like this? calculate pairwise cosineSim and then show top 10. Do you require an abs around the cosineSim value?

library(data.table)
library(stringi)

#generate sample data
set.seed(1)
numRows <- 400
numCols <- 50
rnames <- head(unique(stri_rand_strings(numRows*2, sample(5:11, 5, replace=TRUE), '[a-zA-Z0-9]')),numRows)
dt <- data.table(matrix(rnorm(numRows*numCols), ncol=numCols, dimnames=list(rnames,NULL)), 
    keep.rownames=T)         
userVec <- rnorm(numCols)

#userVec norm and cosineSim calculation function
userVecNorm <- sqrt(sum(userVec^2))
cosSim <- function(x) sum(userVec*x) / userVecNorm / sqrt(sum(x^2))

#calculate pairwise cosineSim with userVec
dt[, cosineSim:=apply(.SD, 1, cosSim), .SDcols=V1:V50]

#order by cosineSim and show top 10
dt[order(cosineSim),][1:10,]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35