1

I am finding cosine similarity between documents.. I did it like this

D1=(8,0,0,1) where 8,0,0,1 are the tf-idf scores of the terms t1, t2, t3 , t4

D2=(7,0,0,1)

cos(theta) = (56 + 0 + 0 + 1) / sqrt(64 + 49) sqrt(1 +1 )

which comes out to be

cos(theta)= 5

Now what do I evaluate from this value... I don't get it what does cos(theta)=5 signify about the similarity between them... Am I doing things right?

Shrayas
  • 6,784
  • 11
  • 37
  • 54
jaskirat
  • 39
  • 1
  • 7
  • 1
    cos(theta) is always between -1 and 1. You are doing something wrong. Also, is this homework? –  May 18 '10 at 18:36

1 Answers1

2

The denominator is wrong.

The cosine similarity is defined as

         D1 · D2
 sim = ———————————
        |D1| |D2|

Here

D1 · D2 = (7*8 + 0*0 + 0*0 + 1*1) = 57
           ______________________    __
   |D2| = √ 7^2 + 0^2 + 0^2 + 1^2 = √50
           ______________________    __
   |D1| = √ 8^2 + 0^2 + 0^2 + 1^2 = √65

So the similarity should be (57 / √(50 * 65)) = 0.999846142, not 5.

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • oh i neglected the zero values....how stupid of me...thanks kennyTM...thank u so much ... – jaskirat May 18 '10 at 18:40
  • @jaskirat: You did not neglect the zero values. You computed the |D1| and |D2| wrongly. There's nothing as √(7^2 + 8^2). – kennytm May 18 '10 at 18:43
  • oh kk...well i took a reference from http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html... – jaskirat May 18 '10 at 18:51
  • i m checking ur answer..but still i m not able to get the same answer as u showed (0.999846142).. – jaskirat May 18 '10 at 18:57
  • got it man...i was just cross checking the results..thanks kenny..enjoy – jaskirat May 18 '10 at 19:01