A common way of calculating the cosine similarity between text based documents is to calculate tf-idf and then calculating the linear kernel of the tf-idf matrix.
TF-IDF matrix is calculated using TfidfVectorizer().
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix_content = tfidf.fit_transform(article_master['stemmed_content'])
Here article_master is a dataframe containing the text content of all the documents.
As explained by Chris Clark here, TfidfVectorizer produces normalised vectors; hence the linear_kernel results can be used as cosine similarity.
cosine_sim_content = linear_kernel(tfidf_matrix_content, tfidf_matrix_content)
This is where my confusion lies.
Effectively the cosine similarity between 2 vectors is:
InnerProduct(vec1,vec2) / (VectorSize(vec1) * VectorSize(vec2))
Linear kernel calculates the InnerProduct as stated here
So the questions are:
Why am I not divding the inner product with the product of the magnitude of the vectors ?
Why does the normalisation exempt me of this requirement ?
Now if I wanted to calculate ts-ss similarity, could I still use the normalised tf-idf matrix and the cosine values (calculated by linear kernel only) ?