Cosine Similarity and TS-SS similarity among documents using tf-idf - Python

Question

A common way of calculating the cosine similarity between text based documents is to calculate tf-idf and then calculating the linear kernel of the tf-idf matrix.

TF-IDF matrix is calculated using TfidfVectorizer().

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix_content = tfidf.fit_transform(article_master['stemmed_content'])

Here article_master is a dataframe containing the text content of all the documents.
As explained by Chris Clark here, TfidfVectorizer produces normalised vectors; hence the linear_kernel results can be used as cosine similarity.

cosine_sim_content = linear_kernel(tfidf_matrix_content, tfidf_matrix_content)

This is where my confusion lies.

Effectively the cosine similarity between 2 vectors is:

InnerProduct(vec1,vec2) / (VectorSize(vec1) * VectorSize(vec2))

Linear kernel calculates the InnerProduct as stated here

So the questions are:

Why am I not divding the inner product with the product of the magnitude of the vectors ?
Why does the normalisation exempt me of this requirement ?
Now if I wanted to calculate ts-ss similarity, could I still use the normalised tf-idf matrix and the cosine values (calculated by linear kernel only) ?

You might ask this on the [data science stack exchange](https://datascience.stackexchange.com) or [stats stack exchange](https://stats.stackexchange.com) which are good places to go for machine learning questions like this one that are more academic than about practical implementation! — Ari Cooper-Davis, Oct 23 '19 at 23:09
I also posted the question in Stats Stack Exchange; but still waiting there as well. — kgkmeekg, Oct 23 '19 at 23:22

score 0 · Answer 1 · answered Oct 24 '19 at 21:39

Thanks to @timleathart 's answer here I finally know the reason.

Normalised vectors have magnitude 1, so it doesn't matter if you explicitly divide by the magnitudes or not. It's mathematically equivalent either way.

The tf-idf vectoriser normalises the individual rows (vectors) so that they are all of length 1. Since cosine similarity is only concerned with the angle, the magnitude difference of the vectors does not matter.

The prime reason behind using ts-ss is to take into account both the angle and the difference in magnitude of the vectors. Hence even though there is nothing wrong in using normalised vectors; however, that beats the whole purpose of using Triangle Similarity component.

Cosine Similarity and TS-SS similarity among documents using tf-idf - Python

1 Answers1