I have followed the explanation of Fred Foo in this stack overflow question: How to compute the similarity between two text documents?
I have run the following piece of code that he wrote:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
print(pairwise_similarity.toarray())
And the result is:
[[1. 0.17668795 0.27056873 0. 0. ]
[0.17668795 1. 0.15439436 0. 0. ]
[0.27056873 0.15439436 1. 0.19635649 0.16815247]
[0. 0. 0.19635649 1. 0.54499756]
[0. 0. 0.16815247 0.54499756 1. ]]
But what I noticed is that when I set corpus to be:
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away"]
and run the same code again, I get the matrix:
[[1. 0.19431434]
[0.19431434 1. ]]
Thus their similarity changes (in the first matrix, their similarity is 0.17668795). Why is that the case? I am really confused. Thank you in advance!