Better understanding of cosine similarity

Question

I am doing a little research on text mining and data mining. I need more help in understanding cosine similarity. I have read about it and notice that all of the given examples on the internet is using tf-idf before computing it through cosine-similarity.

My question

Is it possible to calculate cosine similarity just by using highest frequency distribution from a text file which will be the dataset. Most of the videos and tutorials that i go through has tf-idf ran before inputting it's data into cosine similarity, if no, what other types of equation/algorithm that can be input into cosine similarity?

2.Why is normalization used with tf-idf to compute cosine similarity? (can i do it without normalization?) Cosine similarity are computed from normalization of tf-idf output. Why is normalization needed?

3.What cosine similarity actually does to the weights of tf-idf?

score 0 · Accepted Answer · edited May 23 '17 at 12:30

0

I do not understand question 1.

TF-IDF weighting is a weighting scheme that worked well for lots of people on real data (think Lucene search). But the theoretical foundations of it are a bit weak. In particular, everybody seems to be using a slightly different version of it... and yes, it is weights + cosine similarity. In practise, you may want to try e.g. Okapi BM25 weighting instead, though.
I do not undestand this question either. Angular similarity is beneficial because the length of the text has less influence than with other distances. Furthermore, sparsity can be nicely exploitet. As for the weights, IDF is a heuristic with only loose statistical arguments: frequent words are more likely to occur at random, and thus should have less weight.

Maybe you can try to rephrase your questions so I can fully understand them. Also search for related questions such as these: Cosine similarity and tf-idf and Better text documents clustering than tf/idf and cosine similarity?

edited May 23 '17 at 12:30

Community

1
1

answered Sep 01 '14 at 22:00

Has QUIT--Anony-Mousse

76,138
12
138
194

For Question 1) I have watched tutorials and read some documents on cosine similarity, all of them used tf-idf's output value from a set of data to be passed into cosine similarity equation. What i wanted to ask here is that, is it possible to do cosine similarity with frequency distribution instead of tf-idf's output? 2)Why is normalization needed to be added into tf-idf equation when calculating for cosine similarity? which is ||x||2 I do understand that tf-idf is for weights – user3809384 Sep 01 '14 at 22:16
sorry for the poor question. already re-edited the questions. pretty new in this topic. – user3809384 Sep 01 '14 at 22:30
tf is term frequency... yes, you can do tf only, but tf-idf works better because it puts less weight on frequent words. Normalization - look up the definition of cosine angle... that is math. – Has QUIT--Anony-Mousse Sep 01 '14 at 22:33
this will be a little off topic, i have seen your profile and you have answered some question on k-mean. Can k-mean used to calculate strings? For instance, to cluster keywords into categories, from a database of texts.From my understanding it's only possible to use integers in k-mean and by sorting 4 keywords the centroids will divided equally on 4 edges returning an inaccurate result. – user3809384 Sep 01 '14 at 22:41
K-means can only be used on float vectors. Because it must compute *means*. You can use spherical k-means on documents, but not on keywords. And IMHO, it doesn't really work well. – Has QUIT--Anony-Mousse Sep 01 '14 at 22:57

Better understanding of cosine similarity

1 Answers1